S I G G R A P H 2 0 2 6 OmniRoam World Wandering via Long-Horizon Panoramic Video Generation

Yuheng Liu¹, Xin Lin², Xinke Li³, Baihan Yang², Chen Wang⁴, Kalyan Sunkavalli⁵, Yannick Hold-Geoffroy⁵, Hao Tan⁵, Kai Zhang⁵, Xiaohui Xie^1*, Zifan Shi^5*, Yiwei Hu^5*

¹UC Irvine ²UC San Diego ³City University of Hong Kong ⁴University of Pennsylvania ⁵Adobe Research

Paper Code Models Youtube

Scroll to explore

OmniRoam Studio

Experience panoramic video generation in action

Try Our Studio

Overview

Abstract

Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction.

Preview Stage Results

Given an input image or video along with a camera trajectory, the preview stage generates an 81-frame panoramic video at 480 × 960 resolution.

Forward

Refine Stage Results

The refine stage temporally extends and spatially upsamples the preview output, producing 641-frame panoramic videos at 720 × 1440 resolution for high-fidelity scene wandering.

Forward

How Preview and Refine Stages Work

OmniRoam utilizes a two-stage global-to-local pipeline for long-horizon panoramic video generation. The Preview Stage rapidly constructs a mid-resolution, accelerated overview by decoupling trajectory conditioning into orthogonal flow (direction) and scale (speed) components. The Refine Stage then transforms this preview into a high-fidelity, normal-speed video by using scale alignment and visibility masks as structural guidance to segmentally perform both temporal extension and spatial upsampling.

Loop Consistency

Why Loop Consistency Matters

When a camera traverses a closed-loop trajectory and returns to its starting point, the final frames of the generated panoramic video must seamlessly match the initial frame. This principle of loop consistency is essential for evaluating long-horizon video generation, as it explicitly ensures strict spatial and temporal coherence across the entire sequence and prevents structural drift during extended scene exploration.

Quantitative Results

CLIP Similarity over Loop Trajectories. We show temporal CLIP similarity to the first frame under loop trajectories for long-horizon generation (641 frames). Our method shows the expected trend: similarity decreases as the camera moves away and recovers as the trajectory closes the loop; whereas (a) the autoregressive variant exhibits a largely monotonic decline (drift from the initial view) and (b) the perspective-video variant shows weaker recovery with noticeable structural degradation. Representative frames are shown alongside the curves.

Visual Comparison

Autoregressive vs. Ours

Autoregressive

Ours

Perspective vs. Ours

Perspective

Ours

Quantitative Results

Quantitative Comparison. Our method consistently outperforms prior approaches across all evaluated metrics, achieving superior visual quality, stronger trajectory controllability, and higher loop consistency. FAED, SSIM, and LPIPS are evaluated on 81 frames, while loop consistency is reported on the full sequences.

Design Analysis. We analyze key design choices, including video representation (ours: panoramic vs. perspective) and generation strategy (ours: global-to-local vs. direct autoregressive). We report FAED, SSIM, LPIPS, and loop consistency over the full video sequences. For long videos (641 frames), we additionally report the average PSNR over extended temporal windows, specifically frames 610–615 (PSNR'615) and 630–635 (PSNR'635).

Extensions

Self-forcing Real-time Generation

To achieve real-time panoramic video generation, OmniRoam employs a self-forcing distillation technique that compresses the full model into a lightweight autoregressive previewer. By using the model's own predictions as conditioning for subsequent frames and matching the teacher model's distribution, this approach drastically accelerates generation speed—producing an 81-frame sequence in mere seconds—while successfully preserving the overall scene structure.

The top row shows the input first frame, followed by generated frames along the specified camera trajectories.

3D Scene Reconstruction

Beyond video synthesis, OmniRoam's globally consistent panoramic outputs serve as a robust foundation for 3D scene reconstruction. By extracting multi-view perspective crops from the generated long-horizon sequences, we can apply 3D Gaussian Splatting (3DGS) to reconstruct 3D environments. Thanks to the inherent long-range spatial coherence of our generated videos, the resulting 3D scenes maintain coherent structures across viewpoints.

Left: reconstructed 3DGS; right: sampled panoramic video frames used for reconstruction.

Citation

@article{omniroam2026,
  title   = {OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation},
  author  = {Yuheng Liu and Xin Lin and Xinke Li and Baihan Yang and Chen Wang and Kalyan Sunkavalli and Yannick Hold-Geoffroy and Hao Tan and Kai Zhang and Xiaohui Xie and Zifan Shi and Yiwei Hu},
  journal = {SIGGRAPH},
  year    = {2026},
}