VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Runjia Li, Philip Torr, Andrea Vedaldi, Tomas Jakab
University of Oxford

ArXiv 2025

Input image
Input Image
Generated video (without VMem)
badge
Generated video (with VMem)
badge

We propose a novel plug-and-play memory mechanism for video models to enable consistent autoregressive scene generation conditioned on camera input. Existing methods either rely on inpainting with explicit geometry estimation, which is prone to inaccuracies, or adopt video-based approaches with limited context windows, resulting in poor long-term coherence. To address these limitations, we introduce Surfel Memory of Views (VMem), which anchors past views to the surface elements (surfels) of the scene they observed. This allows novel view generation to be conditioned on the most relevant past views, rather than solely on the most recent ones, improving long-term scene consistency while reducing computational cost.

Quick Explanation

Approach

Our method retrieves the most relevant past views from the surfel-indexed memory using the target camera as a query.

Each reference view is represented by a Plücker embedding and encoded using a VAE. These embeddings and latents are combined with the target Plücker embedding and input noise to generate the novel view.

The generated view is then written back into the memory by updating the surfels or adding new ones based on the predicted geometry. This process is repeated autoregressively for each new view.

Surfel-Indexed View Memory

We use a surfel-indexed memory to retrieve and update relevant past views. The reading module renders surfels to find the most frequently represented past timestamps and retrieves the top-K views as references. The writing module estimates the geometry of the newly generated view, converts it into surfels, and merges them into the memory. The view is then stored along with its timestamp and camera pose for future use.

BibTeX

@misc{li2025vmemconsistentinteractivevideo,
  title={VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory}, 
  author={Runjia Li and Philip Torr and Andrea Vedaldi and Tomas Jakab},
  year={2025},
  eprint={2506.18903},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.18903}, 
}