Controllable 3D Outdoor Scene Generation via Scene Graphs

VIDEO

ABSTRACT

Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses, scene graphs, an accessible, user friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs.

METHOD

The Scene Graph Guided 3D Generation structure consists of three main components: the interactive system (red), BEM processing (blue), and diffusion generation (bottom). Through the interactive system, users can construct their own Scene Graphs using either an interactive interface or text interaction. The constructed scene graph is processed by a GNN, which is jointly trained with the diffusion model using auxiliary tasks to enhance control. Each node in the Scene Graph is then positioned by the Allocation Module to form the BEM. This BEM serves as a conditioning input to the 3D Pyramid Discrete Diffusion Model, which generates the final 3D outdoor scene. Note that "Recon", "Classification", and "CANE" denote "Edge Reconstruction", "Node Classification", and "Context-aware Node Embedding", respectively.

RESULTS

Qualitative Results

Figure 1. Controlling 3D Outdoor Scene Generation with Scene Graphs. We compare baseline methods. Results show that our method generates scenes consistent with the provided scene graph, whereas the SG2Im and LLM approaches exhibit inconsistencies in object quantities and road types.

Figure 1 shows the 3D outdoor scenes generated separately using our method and baseline methods, based on three different scene graphs. The results demonstrate that our method effectively captures the object quantities specified in the scene graph and the road type information. In contrast, the scenes generated by the LLM and SG2Im methods show significant discrepancies in object counts across most categories, and the generated road types differ substantially from the intended configurations.

Quantitative Results

Table 1. Comparison of Different Conditioning Methods on 3D Outdoor Scene Generation. Uncon-Gen, SG2Im, and LLM represent Unconditional Generation, Scene Graph to Image, and Large Language Model, while M-Pole, M-Pede, and M-Vech represent the MAE calculated individually for Pole, Pedestrian, and Vehicle categories. In the Scene Quality Evaluation, higher mIoU and MA scores indicate better semantic consistency, while a lower F3D score signifies closer feature alignment with the original dataset. In the Control Capacity Evaluation, a lower MAE reflects a smaller discrepancy between the generated scene and the object quantities defined in the scene graph for conditioning. A higher Jaccard Index indicates greater alignment in the object categories between the generated scenes and the specified scene graph.

Table 1 compares our method with baselines. In Scene Quality, Uncon-Gen, LLM, and our method perform comparably, while SG2Im lags behind. Meanwhile, in Control Capacity, our method outperforms all baselines across metrics, achieving low MAE values below 1.0, demonstrating precise control over object quantities. In contrast, SG2Im has a higher MAE (0.97), and the LLM baseline yields 1.44, over twice our method’s 0.63, indicating a significant accuracy gap. Additionally, our method achieves a higher Jaccard Index, reflecting its effectiveness in capturing object categories from scene graphs across diverse scenes.

Figure 2. Diversity in Scene Generation. Comparison of three scenes generated by our method under the same scene graph. This demonstrates our method’s ability to produce varied yet consistent scenes based on identical input.

To validate that our method produces diverse outputs rather than strictly memorizing scenes based on the scene graph, we generate scenes three times using the same scene graph. The results are shown in Figure 2. The outcomes demonstrate that our method can generate varied scenes even when conditioned on the same scene graph, yet each generated scene remains consistent with the structural and categorical information provided in the scene graph. This confirms that our method introduces randomness in the generation process while maintaining alignment with the input scene graph.

ABLATION STUDY

Unconditional Proportion

Figure 3. Unconditional Proportion v.s. Generation Quality and Control. Evaluation mIoU, MA, Jaccard Index, and MAE as the unconditional proportion varies during diffusion training. Considering the trade-off between scene quality and control, we choose 0.1 as the balance point.

We examine the effect of the unconditional proportion in diffusion training. Results indicate that scene quality (mIoU and MA) improves as the unconditional proportion increases, with a noticeable bottleneck at 0.1. While further increases lead to marginal improvements in scene quality, they come at the cost of reduced control capacity, as reflected by worsening Jaccard Index and MAE. To balance scene quality and control capacity, we set the unconditional proportion to 0.1 in our experiments.

Effect of Auxiliary Tasks

Table 2. Impact of Auxiliary Tasks on Generation Performance. Comparison of MAE and Jaccard Index w/ and w/o edge reconstruction and node classification tasks in the GNN. Including both tasks yields the best alignment with the scene graph.

We evaluate the impact of adding edge reconstruction and node classification as auxiliary tasks to the GNN during joint training with the diffusion model. As shown in Table 2, both tasks yield the best performance, with a low MAE of 0.63 and a high Jaccard Index of 0.93. Removing either task leads to notable drops in performance, particularly in the Jaccard Index. Omitting both results in further declines. This show that both tasks contribute to improved alignment in scene generation.

Different Training Strategies

Figure 4. Impact of Different Training Strategies. Models trained with the second and last strategies exhibit issues such as vehicles positioned on sidewalks, overlapping objects, and inconsistencies in capturing object quantities. The third strategy generates semantically reasonable scenes but struggles with accurately matching object quantities and road types to the scene graph. In contrast, the first strategy produces high-quality scenes with good alignment to the input scene graph, thus we choose the first strategy to train our networks.

We explore alternative training strategies for our method: (a) pre-train the diffusion model, GNN, and localization head (LOC), then freeze GNN and LOC while fine-tuning the diffusion model; (b) end-to-end training of all components from scratch; (c) pre-train GNN and LOC, freeze their parameters, and train the diffusion model from scratch; and (d) jointly train the diffusion model and GNN from scratch, then freeze GNN and post-train LOC. As shown in Figure 4, strategy (d) achieves the best performance. Strategies (a) and (c) show semantic inconsistencies, while (b) generates scenes of reasonable quality but struggles with object quantity and road type alignment. Joint training of the diffusion model and GNN (d) allows the diffusion model to learn scene structure in sync with encoded features, while post-training LOC assigns precise object positions without disrupting learned structural relationships, achieving a balance between semantic coherence and quantity control.

User Study

Figure 5. DMOS Comparison of Scene Generation Methods. Our method aligns well with scene graph specifications.

We generate 100 pairs of scenes and conduct user studies with 20 subjects. Each user scores paired scenes based on object quantity, positioning, and road type accuracy relative to their scene graphs. The resulting Differential Mean Opinion Score (DMOS), shown in Figure 5, indicates that our method outperforms the baselines. Additionally, we conduct a one-tailed paired t-test on the MOS score difference among three methods. In this test, the null hypothesis is our generation method does not possess higher score than baseline methods. The results support the rejection of null hypothesis at a significance level of p < 10^-3, indicating that our method statistically performs better than both baselines with high confidence.

BibTex


    @article{liu2025controllable3doutdoorscene,
             title={Controllable 3D Outdoor Scene Generation via Scene Graphs},
             author={Yuheng Liu and Xinke Li and Yuning Zhang and Lu Qi and Xin Li and Wenping Wang and Chongshou Li and Xueting Li and Ming-Hsuan Yang},
             year={2025},
             booktitle={arXiv preprint arXiv:2503.07152}
    }