Pyramid Diffusion for Fine 3D Large Scene Generation
Yuheng Liu1,2
Xinke Li3
Xueting Li4
Lu Qi5
Chongshou Li1
Ming-Hsuan Yang5,6
1Southwest Jiaotong University, 
2University of Leeds, 
3National University of Singapore, 
5The University of California, Merced, 
6Google Research

Generative Models


Directly transferring the 2D techniques to 3D scene generation is challenging due to significant resolution reduction and the scarcity of comprehensive real-world 3D scene datasets. To address these issues, our work introduces the Pyramid Discrete Diffusion model (PDD) for 3D scene generation. This novel approach employs a multi-scale model capable of progressively generating high-quality 3D scenes from coarse to fine. In this way, the PDD can generate high-quality scenes within limited resource constraints and does not require additional data sources. To the best of our knowledge, we are the first to adopt the simple but effective coarse-to-fine strategy for 3D large scene generation. Our experiments, covering both unconditional and conditional generation, have yielded impressive results, showcasing the model's effectiveness and robustness in generating realistic and detailed 3D scenes. Our code will be available to the public soon.


In our structure, there are three different scales. Scenes generated by a previous scale can serve as a condition for the current scale after processing through our scale adaptive function. Furthermore, for the final scale processing, the scene from the previous scale is subdivided into four sub-scenes. The final scene is reconstructed into a large scene using our Scene Subdivision module.

Unconditional Generation on CarlaSC

We compare with two baseline models – DiscreteDiff and LatentDiff and show synthesis from our models with different scales. Our method produces more diverse scenes compared to the baseline models. Furthermore, with more levels, our model can synthesize scenes with more intricate details.

Conditional Generation on CarlaSC

We conduct the comparison on conditioned 3D scene generation. We benchmark our method against the discrete diffusion conditioned on unlabeled point clouds and the same coarse scenes. Results in the figure present the impressive results of our conditional generation comparison. Despite the informative condition of the point cloud, our method can still outperform it.

Computational Efficiency

The figure depicts the GPU training time and memory requirements for our PDD on identical configurations. Using a logarithmic scale for training time emphasizes the efficiency gains of our method. The initial training stage, PDD (s1), requires substantially less time—up to 100 times less—compared to training the full DD model. It also minimizes GPU memory usage, which broadens the potential for deployment on hardware with lower specifications. This enhanced efficiency extends to subsequent scales, with the final scale, PDD (s2), only necessitating retraining at smaller scales. Such an approach significantly cuts down on total training time and memory usage, highlighting the pragmatic benefits of our pyramid training architecture.


The Pyramid Discrete Diffusion model shows enhanced quality in scene generation after finetuning with SemanticKITTI data. The fine-tuning process effectively adapts the model to the dataset’s complex object distributions and scene dynamics, resulting in improved results for both generation scenarios. We also highlight that, despite the higher training efforts of the Discrete Diffusion (DD) approach, our method outperforms DD even without fine-tuning, simply by using coarse scenes from SemanticKITTI. This demonstrates the strong cross-data transfer capability of our approach.


This figure demonstrates our model’s ability to generate large-scale, coarse-grained scenes beyond standard dataset dimensions. This initial scale precedes a refinement process that adds detail to these expansive outdoor scenes. Our model produces continuous cityscapes without needing additional inputs. Using our method, it is possible to generate infinite scenes. The figure shows the generation process in scales: beginning with a coarse scene, it focuses on refining a segment into detailed 3D scenes.

@article{liu2023pyramid, title={Pyramid Diffusion for Fine 3D Large Scene Generation}, author={Yuheng Liu and Xinke Li and Xueting Li and Lu Qi and Chongshou Li and Ming-Hsuan Yang}, journal={arXiv preprint arXiv:2311.12085}, year={2023} }