1UC Irvine 2UC San Diego 3City University of Hong Kong 4University of Pennsylvania 5Adobe Research
Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction.
Given an input image or video along with a camera trajectory, the preview stage generates an 81-frame panoramic video at 480 × 960 resolution.
The refine stage temporally extends and spatially upsamples the preview output, producing 641-frame panoramic videos at 720 × 1440 resolution for high-fidelity scene wandering.
OmniRoam utilizes a two-stage global-to-local pipeline for long-horizon panoramic video generation. The Preview Stage rapidly constructs a mid-resolution, accelerated overview by decoupling trajectory conditioning into orthogonal flow (direction) and scale (speed) components. The Refine Stage then transforms this preview into a high-fidelity, normal-speed video by using scale alignment and visibility masks as structural guidance to segmentally perform both temporal extension and spatial upsampling.
When a camera traverses a closed-loop trajectory and returns to its starting point, the final frames of the generated panoramic video must seamlessly match the initial frame. This principle of loop consistency is essential for evaluating long-horizon video generation, as it explicitly ensures strict spatial and temporal coherence across the entire sequence and prevents structural drift during extended scene exploration.
CLIP Similarity over Loop Trajectories. We show temporal CLIP similarity to the first frame under loop trajectories for long-horizon generation (641 frames). Our method shows the expected trend: similarity decreases as the camera moves away and recovers as the trajectory closes the loop; whereas (a) the autoregressive variant exhibits a largely monotonic decline (drift from the initial view) and (b) the perspective-video variant shows weaker recovery with noticeable structural degradation. Representative frames are shown alongside the curves.
Autoregressive
Ours
Perspective
Ours
Quantitative Comparison. Our method consistently outperforms prior approaches across all evaluated metrics, achieving superior visual quality, stronger trajectory controllability, and higher loop consistency. FAED, SSIM, and LPIPS are evaluated on 81 frames, while loop consistency is reported on the full sequences.
Design Analysis. We analyze key design choices, including video representation (ours: panoramic vs. perspective) and generation strategy (ours: global-to-local vs. direct autoregressive). We report FAED, SSIM, LPIPS, and loop consistency over the full video sequences. For long videos (641 frames), we additionally report the average PSNR over extended temporal windows, specifically frames 610–615 (PSNR'615) and 630–635 (PSNR'635).
To achieve real-time panoramic video generation, OmniRoam employs a self-forcing distillation technique that compresses the full model into a lightweight autoregressive previewer. By using the model's own predictions as conditioning for subsequent frames and matching the teacher model's distribution, this approach drastically accelerates generation speed—producing an 81-frame sequence in mere seconds—while successfully preserving the overall scene structure.
The top row shows the input first frame, followed by generated frames along the specified camera trajectories.
Beyond video synthesis, OmniRoam's globally consistent panoramic outputs serve as a robust foundation for 3D scene reconstruction. By extracting multi-view perspective crops from the generated long-horizon sequences, we can apply 3D Gaussian Splatting (3DGS) to reconstruct 3D environments. Thanks to the inherent long-range spatial coherence of our generated videos, the resulting 3D scenes maintain coherent structures across viewpoints.
Left: reconstructed 3DGS; right: sampled panoramic video frames used for reconstruction.
@article{omniroam2026,
title = {OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation},
author = {Yuheng Liu and Xin Lin and Xinke Li and Baihan Yang and Chen Wang and Kalyan Sunkavalli and Yannick Hold-Geoffroy and Hao Tan and Kai Zhang and Xiaohui Xie and Zifan Shi and Yiwei Hu},
journal = {SIGGRAPH},
year = {2026},
}