: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Runjia Li, Philip Torr, Andrea Vedaldi, Tomas Jakab
University of Oxford

In ICCV 2025 ⭐highlight⭐

Input image

Generated video (without VMem)

Generated video (with VMem)

We propose a novel plug-and-play memory module for video models to enable consistent autoregressive scene generation conditioned on camera input. Existing methods either rely on inpainting with explicit geometry estimation, which is prone to inaccuracies, or adopt video-based approaches with limited context windows, resulting in poor long-term coherence. To address these limitations, we introduce Surfel Memory of Views (VMem), which anchors past views to the surface elements (surfels) of the scene they observed. This allows novel view generation to be conditioned on the most relevant past views, rather than solely on the most recent ones, improving long-term scene consistency while reducing computational cost.

Paper Code Demo BibTeX

Quick Explanation

Approach

VMem method overview showing surfel-indexed view memory for consistent video scene generation

Our method retrieves the most relevant past views from the surfel-indexed memory using the target camera as a query.

Each reference view is represented by a Plücker embedding and encoded using a VAE. These embeddings and latents are combined with the target Plücker embedding and input noise to generate the novel view.

The generated view is then written back into the memory by updating the surfels or adding new ones based on the predicted geometry. This process is repeated autoregressively for each new view.

Surfel-Indexed View Memory

We use a surfel-indexed memory to retrieve and update relevant past views. The reading module renders surfels to find the most frequently represented past timestamps and retrieves the top-K views as references. The writing module estimates the geometry of the newly generated view, converts it into surfels, and merges them into the memory. The view is then stored along with its timestamp and camera pose for future use.

Paper

arXiv Preprint

BibTeX

  @article{li2025vmem,
    title={VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory},
    author={Li, Runjia and Torr, Philip and Vedaldi, Andrea and Jakab, Tomas},
    journal={arXiv preprint arXiv:2506.18903},
    year={2025}
  }