
DIO: Decomposable Implicit 4D Occupancy-Flow World Model
Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun
ECCV 2024
Yun Chen*, Jingkang Wang*, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun
Gradient Guided Generalizable Reconstruction (G3R) : Our method learns a single reconstruction network that takes multi-view camera images and an initial point set to predict the 3D representation for large scenes (> 10,000 m^2) in two minutes or less, enabling realistic and real-time camera simulation.
Reconstruction of large real world scenes from sensor data, such as urban traffic scenarios, is a long-standing problem in computer vision. Scene reconstruction enables applications such as virtual reality and camera simulation, where robots such as autonomous vehicles can learn and be evaluated safely at scale. To be effective, the 3D reconstructions must have high photorealism at novel views, be efficient to generate, enable scene manipulation and real-time image rendering.
Neural rendering approaches such as NeRF and 3DGS have achieved realistic reconstructions for large scenes. However, they require a costly per-scene optimization process to reconstruct the scene, which may take several hours to achieve high-quality. Moreover, they typically focus on the novel view synthesis (NVS) setting where the target view is close to the source views and often exhibit artifacts when the viewpoint changes are large (e.g., meter-scale shifts), as it can overfit to the input images while not learning the true underlying 3D representation. To enable faster reconstruction and better performance at novel views, recent works aim to train a generalizable single network across multiple scenes to predict representation for unseen scenes in a zero-shot manner ( generalizable NVS or LRMs ). These methods utilize an encoder to predict the scene representation by aggregating image features extracted from multiple source views according to camera and geometry priors, and then decode the representation. However, they are primarily applied to objects or small scenes due to the complexity of large scenes, which are difficult to predict accurately from a single step network prediction. Furthermore, the computation resources and memory needed to utilize many input scene images (> 100) are prohibitive.
Therefore, we aim to build a method that enables fast, generalizable and robust reconstruction of large scenes. We propose Gradient Guided Generalizable Reconstruction (G3R) to create modifiable digital twins (a set of 3D Gaussian primitives) of large real world scenes in two minutes or less, which can be directly used for high-fidelity simulation at interactive frame rates. The key idea is to combine data-driven priors from fast prediction methods (a) with the iterative gradient feedback signal from per-scene optimization methods (b) by learning to optimize for large scene reconstruction.
We now review G3R's methodology. G3R takes a sequence of LiDAR and camera data, and builds a modifiable digital twin representation for unknown large scenes within a few minitues.
G3R models the generalizable reconstruction as an iterative process, where the 3D representation is iteratively refined with a reconstruction network. We first lift the source 2D images to 3D space by backpropogating the rendering procedure to get the gradients w.r.t the representation ( blue arrow ). Then the reconstruction network takes the 3D representation, the gradient and the iteration step as input, and predicts an updated 3D representation. We render at source and novel views, and compute loss. The backward gradient flow for training is highlighted with dashed blue arrows .
We first introduce our scene representation designed for generalizable large-scene reconstruction. 3DGS represents the scene with a set of 3D Gaussians and achieves state-of-the-art performance. However, it is difficult to predict 3D Gaussians directly by a single network due to the lack of modeling capacity. Instead of using explicit 3D Gaussians (position, scale, color, opacity) like 3DGS, we propose to augment each 3D Gaussian with a latent feature vector ( 3D neural Gaussians ) to provide additional representation capacity and ease the prediction. We then convert the 3D neural Gaussians to a set of explicit color 3D Gaussians for rendering using a MLP network.
To handle dynamic unbounded scenes, we decompose the scene and its set of 3D neural Gaussians into a static background, a set of dynamic actors and a distant region (e.g., far-away buildings and sky). We initialize the 3D neural Gaussians with LiDAR or multi-view stereo points.
One major problem in scene reconstruction is how to lift 2D images to 3D. Existing approaches usually aggregate image features according to geometry priors, which cannot take many source images due to the high memory usage. Instead, we propose to lift 2D images to 3D space by ''rendering and backpropagating'' to obtain gradients w.r.t the 3D representation.
Specifically, given the 3D representation, we first render the scene to source input views. Then, we compare the rendered images with the inputs, compute the reconstruction loss, and backpropagate the difference to the 3D representation. The 3D gradients can efficiently aggregate as many images as needed, takes the rendering procedure into account, naturally handle occlusions, and are fast to compute.
Given the initial 3D neural Gaussians and gradients, we iteratively refine the scene representation. At each step, we take the current 3D representation and use it as a proxy to compute the gradient, then pass to the reconstruction network to update it.
The network is trained using both source images and novel target images to increase the robustness at test time for scene reconstruction.
We now show results highlighting the capabilities of G3R. First, G3R can reconstruct unseen urban driving scenes in 2 minutes or less and achieve high photorealism. (left: real log, right: G3R reconstruction).
Our method works on a variety of scenes with diverse backgrounds, traffic participants and different lighting conditions. The predicted representation can be rendered in real time with FPS> 100.
G3R also supports multi-camera simulation. Here we show videos on all six cameras used on PandaSet.
We can do this at scale on a large variety of logs, in city or suburban areas, and in different lighting conditions, including night.
We could also use G3R to simulate images under different sensor configurations. Our reconstructed representation enables 360 degree rendering. Here we demonstrate the capacity of generating panorama images.
G3R predicts an explicit and editable representation which allows for controllable camera simulation such as sensor and actor shifts. We can change the SDV’s position at different points in time, manipulate actor locations, and render multi-camera and panorama images.
We further consider a more challenging dataset, BlendedMVS, where the view and camera orientation changes are large. G3R can reconstruct the scenes in 3.5min or less and produces high-quality rendering results in real-time.
We further demonstrate that G3R results in a robust 3D Gaussian prediction compared to 3DGS on a more challenging extrapolation setting where we select 20 consecutive frames as source views and simulate the future 3 frames. 3DGS (left) has more noticeable artifacts in those extrapolated views compared to G3R (right).
Here are more examples. G3R’s representation is more robust to extrapolation views.
Finally, we showcase G3R's ability to generalize to driving scenes in Waymo Open Dataset. G3R, trained only on PandaSet, can generalize well across datasets, unveiling the potential for scalable real-world sensor simulation.
In this paper, we introduce G3R, a novel approach for efficient generalizable large-scale 3D scene reconstruction. By leveraging gradient feedback signals from differentiable rendering, G3R achieves acceleration of at least 10x over state-of-the-art per-scene optimization methods, with comparable or superior photorealism. Importantly, our method predicts a standalone 3D representation that exhibits robustness to large view changes and enables real-time rendering, making it well-suited for VR and simulation. Experiments on urban-driving and drone datasets showcase the efficacy of G3R for in-the-wild 3D scene reconstruction. Our learning-to-optimize paradigm with gradient signal can apply to other 3D representations such as triplanes with NeRF rendering, or other inverse problems such as generalizable surface reconstruction.
@inproceedings{
chen2024g3r,
title={G3R: Gradient Guided Generalizable Reconstruction},
author={Yun Chen and Jingkang Wang and Ze Yang and Sivabalan Manivasagam and Raquel Urtasun},
booktitle={European Conference on Computer Vision},
year={2024},
}
Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun
Ben Agro, Sergio Casas, Patrick Wang, Thomas Gilles, Raquel Urtasun
Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, Raquel Urtasun
Yun Chen*, Matthew Haines*十, Jingkang Wang, Krzysztof Baron-Lis, Sivabalan Manivasagam, Ze Yang, Raquel Urtasun
UniCal: Unified Neural Sensor Calibration
Chris Zhang, Sourav Biswas, Kelvin Wong, Kion Fallah, Lunjun Zhang, Dian Chen, Sergio Casas, Raquel Urtasun
Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun
Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun
Jack Lu†*, Kelvin Wong*, Chris Zhang, Simon Suo, Raquel Urtasun