G3R: Gradient guided generalizable reconstruction

ECCV 2024

Yun Chen*, Jingkang Wang*, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun

by Waabi

Large scale 3D scene reconstruction is important for applications such as virtual reality and simulation. Existing neural rendering approaches (e.g., NeRF, 3DGS) have achieved realistic reconstructions on large scenes, but optimize per scene, which is expensive and slow, and exhibit noticeable artifacts under large view changes due to overfitting. Generalizable approaches or large reconstruction models are fast, but primarily work for small scenes/objects and often produce lower quality rendering results. In this work, we introduce G3R, a generalizable reconstruction approach that can efficiently predict high-quality 3D scene representations for large scenes. We propose to learn a reconstruction network that takes the gradient feedback signals from differentiable rendering to iteratively update a 3D scene representation, combining the benefits of high photorealism from per-scene optimization with data-driven priors from fast feed-forward prediction methods. Experiments on urban-driving and drone datasets show that G3R generalizes across diverse large scenes and accelerates the reconstruction process by at least 10x while achieving comparable or better realism compared to 3DGS, and also being more robust to large view changes.

Overview

Gradient Guided Generalizable Reconstruction (G3R) : Our method learns a single reconstruction network that takes multi-view camera images and an initial point set to predict the 3D representation for large scenes (> 10,000 m^2) in two minutes or less, enabling realistic and real-time camera simulation.

Video

Motivation

Reconstruction of large real world scenes from sensor data, such as urban traffic scenarios, is a long-standing problem in computer vision. Scene reconstruction enables applications such as virtual reality and camera simulation, where robots such as autonomous vehicles can learn and be evaluated safely at scale. To be effective, the 3D reconstructions must have high photorealism at novel views, be efficient to generate, enable scene manipulation and real-time image rendering.

Neural rendering approaches such as NeRF and 3DGS have achieved realistic reconstructions for large scenes. However, they require a costly per-scene optimization process to reconstruct the scene, which may take several hours to achieve high-quality. Moreover, they typically focus on the novel view synthesis (NVS) setting where the target view is close to the source views and often exhibit artifacts when the viewpoint changes are large (e.g., meter-scale shifts), as it can overfit to the input images while not learning the true underlying 3D representation. To enable faster reconstruction and better performance at novel views, recent works aim to train a generalizable single network across multiple scenes to predict representation for unseen scenes in a zero-shot manner ( generalizable NVS or LRMs ). These methods utilize an encoder to predict the scene representation by aggregating image features extracted from multiple source views according to camera and geometry priors, and then decode the representation. However, they are primarily applied to objects or small scenes due to the complexity of large scenes, which are difficult to predict accurately from a single step network prediction. Furthermore, the computation resources and memory needed to utilize many input scene images (> 100) are prohibitive.

Therefore, we aim to build a method that enables fast, generalizable and robust reconstruction of large scenes. We propose Gradient Guided Generalizable Reconstruction (G3R) to create modifiable digital twins (a set of 3D Gaussian primitives) of large real world scenes in two minutes or less, which can be directly used for high-fidelity simulation at interactive frame rates. The key idea is to combine data-driven priors from fast prediction methods (a) with the iterative gradient feedback signal from per-scene optimization methods (b) by learning to optimize for large scene reconstruction.

Method

We now review G3R's methodology. G3R takes a sequence of LiDAR and camera data, and builds a modifiable digital twin representation for unknown large scenes within a few minitues.

Overview of G3R

G3R models the generalizable reconstruction as an iterative process, where the 3D representation is iteratively refined with a reconstruction network. We first lift the source 2D images to 3D space by backpropogating the rendering procedure to get the gradients w.r.t the representation ( blue arrow ). Then the reconstruction network takes the 3D representation, the gradient and the iteration step as input, and predicts an updated 3D representation. We render at source and novel views, and compute loss. The backward gradient flow for training is highlighted with dashed blue arrows .

Scene Representation

We first introduce our scene representation designed for generalizable large-scene reconstruction. 3DGS represents the scene with a set of 3D Gaussians and achieves state-of-the-art performance. However, it is difficult to predict 3D Gaussians directly by a single network due to the lack of modeling capacity. Instead of using explicit 3D Gaussians (position, scale, color, opacity) like 3DGS, we propose to augment each 3D Gaussian with a latent feature vector ( 3D neural Gaussians ) to provide additional representation capacity and ease the prediction. We then convert the 3D neural Gaussians to a set of explicit color 3D Gaussians for rendering using a MLP network.

To handle dynamic unbounded scenes, we decompose the scene and its set of 3D neural Gaussians into a static background, a set of dynamic actors and a distant region (e.g., far-away buildings and sky). We initialize the 3D neural Gaussians with LiDAR or multi-view stereo points.

Lift 2D Images to 3D as Gradients

One major problem in scene reconstruction is how to lift 2D images to 3D. Existing approaches usually aggregate image features according to geometry priors, which cannot take many source images due to the high memory usage. Instead, we propose to lift 2D images to 3D space by ''rendering and backpropagating'' to obtain gradients w.r.t the 3D representation.

Specifically, given the 3D representation, we first render the scene to source input views. Then, we compare the rendered images with the inputs, compute the reconstruction loss, and backpropagate the difference to the 3D representation. The 3D gradients can efficiently aggregate as many images as needed, takes the rendering procedure into account, naturally handle occlusions, and are fast to compute.

Iterative Reconstruction with a Neural Network

Given the initial 3D neural Gaussians and gradients, we iteratively refine the scene representation. At each step, we take the current 3D representation and use it as a proxy to compute the gradient, then pass to the reconstruction network to update it.

The network is trained using both source images and novel target images to increase the robustness at test time for scene reconstruction.

Generalizable Reconstruction

We now show results highlighting the capabilities of G3R. First, G3R can reconstruct unseen urban driving scenes in 2 minutes or less and achieve high photorealism. (left: real log, right: G3R reconstruction).

0:000:00

Our method works on a variety of scenes with diverse backgrounds, traffic participants and different lighting conditions. The predicted representation can be rendered in real time with FPS> 100.

0:000:00

Multi-Camera Simulation

G3R also supports multi-camera simulation. Here we show videos on all six cameras used on PandaSet.

0:000:00

We can do this at scale on a large variety of logs, in city or suburban areas, and in different lighting conditions, including night.

0:000:00

Panorama Image Simulation

We could also use G3R to simulate images under different sensor configurations. Our reconstructed representation enables 360 degree rendering. Here we demonstrate the capacity of generating panorama images.

0:000:00

Controllable Camera Simulation

G3R predicts an explicit and editable representation which allows for controllable camera simulation such as sensor and actor shifts. We can change the SDV’s position at different points in time, manipulate actor locations, and render multi-camera and panorama images.

0:000:00

Free-View Synthesis on BlendedMVS

We further consider a more challenging dataset, BlendedMVS, where the view and camera orientation changes are large. G3R can reconstruct the scenes in 3.5min or less and produces high-quality rendering results in real-time.

0:000:00

Robust 3D Gaussian Prediction

We further demonstrate that G3R results in a robust 3D Gaussian prediction compared to 3DGS on a more challenging extrapolation setting where we select 20 consecutive frames as source views and simulate the future 3 frames. 3DGS (left) has more noticeable artifacts in those extrapolated views compared to G3R (right).

0:000:00

Here are more examples. G3R’s representation is more robust to extrapolation views.

0:000:00

Generalization Study

Finally, we showcase G3R's ability to generalize to driving scenes in Waymo Open Dataset. G3R, trained only on PandaSet, can generalize well across datasets, unveiling the potential for scalable real-world sensor simulation.

0:000:00

Conclusion

In this paper, we introduce G3R, a novel approach for efficient generalizable large-scale 3D scene reconstruction. By leveraging gradient feedback signals from differentiable rendering, G3R achieves acceleration of at least 10x over state-of-the-art per-scene optimization methods, with comparable or superior photorealism. Importantly, our method predicts a standalone 3D representation that exhibits robustness to large view changes and enables real-time rendering, making it well-suited for VR and simulation. Experiments on urban-driving and drone datasets showcase the efficacy of G3R for in-the-wild 3D scene reconstruction. Our learning-to-optimize paradigm with gradient signal can apply to other 3D representations such as triplanes with NeRF rendering, or other inverse problems such as generalizable surface reconstruction.

BibTeX

@inproceedings{
  chen2024g3r,
  title={G3R: Gradient Guided Generalizable Reconstruction},
  author={Yun Chen and Jingkang Wang and Ze Yang and Sivabalan Manivasagam and Raquel Urtasun},
  booktitle={European Conference on Computer Vision},
  year={2024},
}