GenRe: Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction
ICRA 2026
Henry Che, Jingkang Wang, Yun Chen, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun
Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.
TL;DR
We introduce GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within minutes, producing robust, high-fidelity reconstructions that render reliably at novel viewpoints.

Motivation
Realistic simulation is essential to test safety-critical self-driving systems in a safe and scalable manner. Neural rendering approaches such as 3D Gaussian Splatting (3DGS) achieve realistic reconstruction of urban driving scenes, but often overfit to the training trajectories, leading to significant artifacts and quality degradation when extrapolating beyond the original trajectory (e.g., meter-scale shifts). Recent works propose to train 2D neural fixers by fine-tuning diffusion models to correct artifacts at novel views, and then distill the improvements back into the 3D representation. However, these pipelines require hours of per-scene optimization and have difficulty scaling. The distilled representations remain fragile and usually generalize only to small synthesized viewpoint shifts.
We propose GenRe, a generalizable enhancer that takes any pre-trained 3D Gaussian representation and fixes the deficiencies within a few minutes. At the heart of GenRe are two modules: a one-step diffusion neural fixer (FNet) that predicts view-conditioned residuals at novel views, guided by geometry and appearance cues; and a generalizable 3D enhancer (ENet) that updates Gaussian parameters to enforce multi-view and geometric consistency while preserving fidelity along both recorded and novel trajectories.

Method
GenRe is composed of three steps:
Step 1. Obtain Initial 3D Gaussian Representation: Any 3DGS-based reconstruction method (per-scene or generalizable) is used to obtain an initial representation.
Step 2. Fix Novel Viewpoints (FNet): We render at novel viewpoints (e.g., 3m lateral shifts) and adopt a diffusion-based neural fixer to correct degraded artifacts. FNet takes the rendered view, conditions on a reference image and a rendered LiDAR map, and produces a fixed image using a single-step diffusion model (SD-Turbo).
Step 3. Enhance the Representation (ENet): A generalizable 3D enhancer predicts per-Gaussian residuals to update the 3D representation, enforcing multi-view consistency. ENet iteratively refines the representation using rendering-guided gradients, distilling the 2D corrections back into 3D.

2D Neural Fixer (FNet)
FNet takes a 3DGS-rendered view, conditions on the reference image and the rendered LiDAR map, and produces the fixed image. We fine-tune FNet from the pre-trained single-step diffusion model SD-Turbo.

Generalizable 3D Enhancer (ENet)
ENet iteratively refines a 3DGS scene using rendering-guided gradients. At each iteration, ENet takes the current 3D Gaussians and per-Gaussian gradients (from rendering loss) and predicts residuals to update the scene. Source and novel views are compared with ground-truth and fixed targets to compute losses whose backpropagation gives the gradients for the next iteration.

By combining a generalizable reconstruction module (GNet), the 2D fixer (FNet), and the 3D enhancer (ENet), we obtain GenRe+, a robust and scalable pipeline for urban scene reconstruction. GNet reconstructs the base scene representation from sensory data; FNet corrects artifacts at novel-view renderings; and ENet distills these corrections back into the 3D representation.
Results
We evaluate GenRe on PandaSet, a self-driving dataset with diverse urban scenes. GenRe outperforms state-of-the-art methods in both quality and efficiency under challenging novel viewpoints while maintaining competitive performance along the recorded trajectories.
Comparison to Reconstruction Methods
We visualize GenRe's ability to fix rendering artifacts at novel viewpoints. The top row shows the original 3DGS rendering at the recorded trajectory and at a 3m lateral shift, where significant degradation is visible. The bottom row shows the corresponding GenRe results. GenRe produces high-quality renderings at novel lane shifts while preserving fidelity along the original trajectory, and can plausibly complete previously unobserved regions.

We further compare against recent state-of-the-art methods that combine reconstruction with 2D neural fixers, including Difix3D (CVPR'25) and StreetCrafter (CVPR'25). While these methods produce degraded lane markings and blurry vehicles and require significantly longer processing times, GenRe delivers sharper, more realistic results in 2.8 minutes.

Comparison to 2D Neural Fixers
We further compare GenRe's neural fixer (FNet) against state-of-the-art 2D neural fixers. Our fixer produces sharper and more consistent results with fewer hallucinations.

Downstream Applications
Realistic Re-simulation
To test whether more robust reconstruction benefits downstream simulation, we emulate open-loop re-simulation by branching from the recorded trajectory and rendering along perturbed ego paths. GenRe yields higher quality rendering for all behaviors (braking, acceleration, lane change, swerving) compared to baselines.

GenRe supports diverse variants for reactive log replay, such as dynamic actor removals, actor insertions, and actor manipulation, with high-quality, realistic rendering.

Downstream Perception
GenRe shows minimal detection and segmentation domain gap, demonstrating that the enhanced representations are suitable for downstream perception tasks.

Conclusion
We introduce GenRe, a diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pre-trained 3D Gaussian representation and fixes the deficiencies within 2 minutes in a generalizable manner. At the heart of GenRe are two modules: a one-step diffusion neural fixer that fixes degraded rendered images and a generalizable enhancer that predicts per-Gaussian residuals to enhance the representation at novel views. Additionally, we show that by adapting the enhancer for scene reconstruction from scratch, we obtain a generalizable reconstruction model that can robustly reconstruct the scene within 60s. Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.
@inproceedings{che2026genre,
title={Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction},
author={Henry Che and Jingkang Wang and Yun Chen and Ze Yang and Sivabalan Manivasagam and Raquel Urtasun},
booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
year={2026},
}