GenAssets: Generating in-the-wild 3D assets in latent space

CVPR 2025

Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, Raquel Urtasun

by Waabi

High-quality 3D assets for traffic participants are critical for multi-sensor simulation, which is essential for the safe end-to-end development of autonomy. Building assets from in-the-wild data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that render well only from viewpoints close to the original observations, limiting their usefulness in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. Our method first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a diffusion model that operates within the latent space. We show that our method outperforms existing reconstruction and generation based methods, unlocking diverse and scalable content creation for simulation.

Video

Motivation

Simulation enables safe, controlled, efficient and scalable autonomy development in long-tail and safety-critical scenarios. Realistic sensor simulation (e.g., camera, LiDAR) requires high-quality 3D assets that accurately represent the geometry and appearance of real world objects. Commoditized simulation environments rely on manually created 3D assets, a slow, costly process that lacks real-world diversity. Reconstructing assets from real data is an alternative but suffers from artifacts at novel views and limited scalability, as each asset must be observed for reconstruction. Recent 3D generative models improve diversity but rely on ground-truth 3D supervision or synthetic datasets to learn the shape and appearance priors, struggling with real-world data, which is typically sparse, occluded, and noisy.

To enable more scalable assets creation, we propose GenAssets, a latent diffusion 3D generative model that learns directly from in-the-wild data. We tackle the challenges of asset completion and generation via a two-stage “reconstruct-then-generate” approach. In the first stage, we learn a low-dimensional object latent space that generates complete assets by training across multiple scenes via occlusion-aware neural rendering. In the second stage, we train a diffusion model on this 3D latent space to generate realistic assets that can be conditioned on individual views, time of day, or fine-grained actor class. The resulting approach can reconstruct or generate 360° assets from in-the-wild data, producing diverse, high-quality 3D assets for realistic, scalable sensor simulation.

Method

We now review how GenAssets works. GenAssets is a two-stage framework. In the first stage, we jointly learn a set of latent codes through occlusion-aware neural rendering across diverse scenes. In the second stage, we train a diffusion model to learn generative priors in the latent space, enabling the generation of realistic and diverse neural assets.

Learning Latent Asset Representation

We begin by learning a latent representation for each asset. Per-scene reconstruction methods train each scene separately, which limits their ability to resolve ambiguities in unseen regions and leads to poor generalization to novel viewpoints. While jointly training on all 3D scene representations in a dataset with tens of thousands of assets would improve generalization, it is computationally expensive and memory-intensive. To address these limitations, we propose learning a latent code for each actor, combined with a shared asset decoder that maps the latent code into the 3D representation. This latent bottleneck also encourages the model to learn shape and appearance priors, enabling it to infer occluded or unobserved regions from sparse observations.

Learning Latent Asset Diffusion Model

With the learned asset latent library from the dataset, we aim to learn generative priors using a diffusion model. Diffusion models are probabilistic frameworks that model the data distribution by progressively denoising a Gaussian noise variable. In our approach, the diffusion model operates directly in the latent space and begins with Gaussian noise, progressively denoising it to recover the underlying latent distribution. Learning the diffusion process in the latent space offers key advantages for likelihood-based generative modeling by: (i) focusing on essential contents of the data and (ii) operating in a computationally efficient, compact space. After we learn the diffusion model, we can then sample from the learned diffusion priors for conditional or unconditional generation of the latent codes, which can then be decoded into neural assets. In the figure below, the left side shows training the diffusion model in latent space while the right shows sampling the diffusion model for (un)conditional asset generation.

3D Reconstruction and Novel View Synthesis

We conduct experiments on the PandaSet dataset which includes 103 driving scenes. We select 7 diverse scenes to evaluate performance, with the remaining 96 scenes for training. We evaluate our method across three challenging settings: (1) Sparse View Synthesis: Using every 10th frame for training and the remaining frames for testing, with both training and testing frames captured from the front camera. (2) Novel Camera Synthesis: Training on frames from the front camera and evaluating on frames from the front-left camera. (3) 360° View Synthesis: Rotating actors (0°–360°) to simulate various behaviors, evaluated on front camera views.

Sparse View Synthesis

We first compare our method against state-of-the-art approaches for sparse view synthesis. Unlike prior methods that uses at least 50% frames for training, our setup is more challenging, relying on only 10% of frames for training. We found that baseline methods struggle to learn robust geometry, leading to severe artifacts in novel viewpoints (e.g., missing, blurry or distorted appearance). While GenAssets renders sensor data more accurately thanks to compact latent space learned across many scenes.

Novel Camera Synthesis

We also compare our method against baselines for novel camera synthesis, where we train on frames from the front camera and evaluate on frames from the front-left camera, which captures significantly different viewpoints. Compared to the baselines, GenAssets can synthesize high-quality images under this novel camera view thanks to learned actor priors, while existing methods produce significant artifacts due to incomplete geometry and appearance.

360° View Synthesis

Sensor simulation requires not only simulating novel viewpoints but also synthesizing entirely new scenarios. To that end, we investigate a 360° view synthesis setting by rotating actors in the scene from 0° to 360°, simulating various scenarios. Through multi-scene training and latent space learning, our method effectively hallucinates unseen parts of the assets, whereas all baselines struggle to do so.

3D Generation

We now compare to SoTA generative models for unconditional 3D generation. And we show our results on conditional generation and single image to 3D generation.

Unconditional Generation

We compare our method against SoTA GAN-based method EG3D and DiscoScene, as well as diffusion-based method SSDNeRF. GenAssets generates more diverse, complete and high-quality 3D assets compared to SoTA methods.

Conditional Generation

The flexibility of our framework enables various conditional generation tasks. Specifically, we freeze the learned latent codes and train a conditional diffusion model using classifier-free guidance. We explore conditioning on fine-grained actor classes and time-of-day (day/night). We can generate complete assets for a variety classes, with intra-class shape and appearance variation. GenAssets allows us to control the generation process for diverse asset creation.

Single Image to 3D Generation

Our approach also enables generating 3D assets from single-view images using the rendering-guided denoising process, where the denoising gradient is directed to minimize rendering loss against the observed image. Compared to SoTA 3D large models MeshFormer, CRM and InstantMesh, our approach generates higher quality 360° completion and is more multi-view consistent. Due to the reliance on object-centric synthetic dataset training, existing 3D large models usually produce cartoonish generation results especially on unobserved views.

Scalable Sensor Simulation

Simulating Asset Variations

We show that we’re able to generate actor variations for scalable sensor simulation. For each example in a row, we show the original reconstructed scene, as well as the same scene layouts with different class-conditioned generated assets rendered in. We are able to generate and simulate different classes and at different time-of-day. Our simulated scenes are realistic and create new variations for autonomy development.

Simulating Extreme Scenario Variations

With GenAssets, we can simulate entirely new scenarios. For example, to explore a situation where a car in the nearby lane suddenly performs a sharp U-turn into your lane, we must render actors in significantly different poses than those seen in the original scene. GenAssets creates complete and high-quality 3D assets and enables more extreme scenario variations than previously possible.

Conclusion

In this work, we tackled the challenge of generating high-quality and complete assets from in-the-wild data captured by a moving sensor platform. Towards this goal, we developed a “reconstruct-then-generate” approach where we first learn to reconstruct foreground actors over multiple scenes with compositional scene neural rendering and encode them to a latent space. We then train a diffusion model to operate within this latent space to enable generation. We show our method generates high-quality, complete assets for actors such as vehicles and motorcycles, outperforming both per-scene reconstruction methods and generative models. We also show that our approach can be conditioned for controllable asset generation such as on sparse sensor data, actor class, and time of day, enabling diverse and scalable content creation for simulation.

BibTeX

@article{yang2025genassets,
  title     = {GenAssets: Generating in-the-wild 3D Assets in Latent Space},
  author    = {Ze Yang and Jingkang Wang and Haowei Zhang and Sivabalan Manivasagam and Yun Chen and Raquel Urtasun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
}