Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses scene graphs—an accessible, user-friendly control format—to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Layout, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs.
The Scene Graph Guided 3D Generation structure consists of three main components: the interactive system (red), BEL processing (blue), and diffusion generation (bottom). Through the interactive system, users can construct their own Scene Graphs using either an interactive interface or text interaction. The constructed scene graph is processed by a GNN, which is jointly trained with the diffusion model using auxiliary tasks to enhance control. Each node in the Scene Graph is then positioned by the Allocation Module to form the BEL. This BEL serves as a conditioning input to the 3D Pyramid Discrete Diffusion Model, which generates the final 3D outdoor scene. Note that "Recon", "Classification", and "CANE" denote "Edge Reconstruction", "Node Classification", and "Context-aware Node Embedding", respectively.
Figure 1. Controlling 3D Outdoor Scene Generation with Scene Graphs. We compare baseline method -- GLM generation. Results show our method generates scenes consistent with the provided scene graph, while the GLM approach exhibits inconsistencies in object quantities and road types. We mark the objects missed by the baseline GLM in red boxes.
Figure 1 shows the 3D outdoor scenes generated separately using our method and baseline GLM methods, based on three different scene graphs. The results demonstrate that our method effectively captures the object quantities specified in the scene graph and the road type information. In contrast, the scenes generated by the GLM method show significant discrepancies in object counts across most categories, and the generated road types differ substantially from the intended configurations.
Table 1. Comparison of Different Conditioning Methods on 3D Outdoor Scene Generation. Uncon-Gen and Graph Language Model (GLM) represent Unconditional Generation and Graph Language Model, while M-Pole, M-Pede, and M-Vech represent the MAE calculated individually for Pole, Pedestrian, and Vehicle categories. In the Scene Quality Evaluation, higher mIoU and MA scores indicate better semantic consistency, while a lower F3D score signifies closer feature alignment with the original dataset. In the Quantity Matching Evaluation, a lower MAE reflects a smaller discrepancy between the generated scene and the object quantities defined in the scene graph for conditioning. A higher Jaccard Index indicates greater alignment in the object categories between the generated scenes and the specified scene graph.
Table 1 compares our method with baseline methods. In Scene Quality, all three methods produce high-quality scenes with largely comparable performance. However, in the Quantity Matching evaluation, our method consistently outperforms the baselines across all metrics. Furthermore, our method shows robust performance with low MAE values below 1.0, especially among three representative categories, indicating its capability to precisely control the specific quantities of objects in the generated scenes. In contrast, the GLM baseline yields an MAE of 1.44, more than twice that of our method's MAE of 0.63, suggesting a significant difference in accuracy. Additionally, our method achieves a higher Jaccard Index, which reflects its ability to capture most of the object categories in the scene graph across a wide range of scenes.
Figure 2. Diversity in Scene Generation. Comparison of three scenes generated by our method under the same scene graph. This demonstrates our method’s ability to produce varied yet consistent scenes based on identical input.
To validate that our method produces diverse outputs rather than strictly memorizing scenes based on the scene graph, we generate scenes three times using the same scene graph. The results are shown in Figure 2. The outcomes demonstrate that our method can generate varied scenes even when conditioned on the same scene graph, yet each generated scene remains consistent with the structural and categorical information provided in the scene graph. This confirms that our method introduces randomness in the generation process while maintaining alignment with the input scene graph.
Figure 3. Unconditional Proportion v.s. Evaluation mIoU, MA, Jaccard Index, and MAE as the unconditional proportion varies during diffusion training.
We examine the effect of the unconditional proportion in diffusion training, as shown in Figure 3. Results indicate that increasing the unconditional proportion improves mIoU and MA, but the improvements encounter a bottleneck after the unconditional proportion reaches 0.1. This suggests that while a higher proportion helps generate semantically accurate scenes, further increases yield minimal quality benefits. However, beyond 0.1, both the Jaccard Index and MAE worsen due to reduced adherence to the structure of the scene graph, as excessive unconditional data weakens control over object quantity and category alignment. Therefore, we set the unconditional proportion to 0.1 in all experiments.
Table 2. Impact of Auxiliary Tasks on Generation Performance. Comparison of MAE and Jaccard Index w/ and w/o edge reconstruction and node classification tasks in the GNN. Including both tasks yields the best alignment with the scene graph.
We evaluate the impact of adding edge reconstruction and node classification as auxiliary tasks to the GNN during joint training with the diffusion model. As shown in Table 2, both tasks yield the best performance, with a low MAE of 0.63 and a high Jaccard Index of 0.93. Removing either task leads to notable drops in performance, particularly in the Jaccard Index. Omitting both results in further declines. This show that both tasks contribute to improved alignment in scene generation.
Figure 4. Impact of Different Training Strategies. Models trained with the second and last strategies exhibit issues such as vehicles positioned on sidewalks, overlapping objects, and inconsistencies in capturing object quantities. The third strategy generates semantically reasonable scenes but struggles with accurately matching object quantities and road types to the scene graph. In contrast, the first strategy produces high-quality scenes with good alignment to the input scene graph, thus we choose the first strategy to train our networks.
We explore alternative training strategies for our method: (a) pre-train the diffusion model, GNN, and localization head (LOC), then freeze GNN and LOC while fine-tuning the diffusion model; (b) end-to-end training of all components from scratch; (c) pre-train GNN and LOC, freeze their parameters, and train the diffusion model from scratch; and (d) jointly train the diffusion model and GNN from scratch, then freeze GNN and post-train LOC. As shown in Figure 4, strategy (d) achieves the best performance. Strategies (a) and (c) show semantic inconsistencies, while (b) generates scenes of reasonable quality but struggles with object quantity and road type alignment. Joint training of the diffusion model and GNN (d) allows the diffusion model to learn scene structure in sync with encoded features, while post-training LOC assigns precise object positions without disrupting learned structural relationships, achieving a balance between semantic coherence and quantity control.
Figure 5. DMOS Comparison of Scene Generation Methods. Our method aligns well with scene graph specifications.
We generate 100 pairs of scenes and conduct user studies with 20 subjects. Each user scores paired scenes based on object quantity, positioning, and road type accuracy relative to their scene graphs. The resulting Differential Mean Opinion Score (DMOS), shown in Figure 5, indicates that our method outperforms the baselines. Additionally, we conduct an one-tailed paired t-test on the MOS score difference among three methods. In this test, the null hypothesis is our generation method does not possess higher score than baseline methods. The results support the rejection of null hypothesis at a significance level of $p < 10^{-3}$, indicating that our method statistically performs better than both baselines with high confidence.