Neural Lighting Simulation for Urban Scenes

Ava Pun* †, Gary Sun* †, Jingkang Wang*, Yun Chen, Ze Yang, Sivabalan Manivasagam, Wei-Chiu Ma, Raquel Urtasun

* Equal contributions , † Work done while an intern at Waabi

Conference: NeurIPS 2023

Categories: Sensor Simulation, Digital Twins

Abstract

Different outdoor illumination conditions drastically alter the appearance of urban scenes, and they can harm the performance of image-based robot perception systems if not seen during training. Camera simulation provides a cost-effective solution to create a large dataset of images captured under different lighting conditions. Towards this goal, we propose LightSim, a neural lighting camera simulation system that enables diverse, realistic, and controllable data generation. LightSim automatically builds lighting-aware digital twins at scale from collected raw sensor data and decomposes the scene into dynamic actors and static background with accurate geometry, appearance, and estimated scene lighting. These digital twins enable actor insertion, modification, removal, and rendering from a new viewpoint, all in a lighting-aware manner. LightSim then combines physically-based and learnable deferred rendering to perform realistic relighting of modified scenes, such as altering the sun location and modifying the shadows or changing the sun brightness, producing spatially- and temporally-consistent camera videos. Our experiments show that LightSim generates more realistic relighting results than prior work. Importantly, training perception models on data generated by LightSim can significantly improve their performance.

Overview

Neural Lighting Simulation. LightSim builds digital twins from large-scale data with lighting variations and generates high-fidelity simulation videos. Top: LightSim produces realistic scene relighting and shadow editing videos. Bottom: We generate a safety-critical scenario with two vehicles cutting in and perform lighting-aware camera simulation.

Video

Motivation

Cameras are a rich sensor modality for robots, such as self-driving vehicles, to perceive outdoor scenes. Unfortunately, existing camera-based perception systems do not perform well when observing camera data under outdoor illumination conditions different from what they were trained on. Camera simulation can help generate a rich dataset of variations to improve the perception system’s robustness.

One common approach to camera simulation is through game engines, which use artist-designed assets and manually-specified lighting conditions to render scene camera data in a physically-based manner. Unfortunately, the simulated camera data lack diversity and realism, as the number of assets is limited and the physically-based rendering results do not exactly match the real-world scene. This leads to a domain gap for perception model training and poor generalization to real data. Another camera simulation strategy is through data-driven approaches, which leverage neural rendering to reconstruct digital twins of the real world that replicate the observed sensor data. While this allows for more scalable creation of scenes and improves realism, existing data-driven camera simulation methods bake the scene illumination into the representation. This prevents modifications of the digital twin to create new simulated camera videos, such as changing the lighting conditions or removing and inserting new actors in a lighting-aware manner.

Therefore, we aim to create a diverse, controllable, and realistic camera simulator. Towards this goal, we focus on building relightable digital twins, which we leverage to create a simulator that can generate camera data of scenes at scale under diverse lighting conditions. We propose LightSim, which takes a sequence of multi-sensor data to reconstruct a relightable digital twin with decomposed assets and lighting representations. We can then modify these digital twins to alter the lighting conditions, change existing actors’ trajectories, and add new objects such as construction barriers or vehicles. This enables us to render new simulated camera videos that are lighting-consistent, with accurate shadows.

Method

We now review Lightsim’s methodology. LightSim takes a sequence of LiDAR and camera data to build a relightable digital twin representation.

Step 1: Building Relightable Digital Twins of the Real World

The first step is to recreate the observed sensor data using compositional neural fields that represent the dynamic actors and static scene. We use a view-independent version of UniSim to reconstruct the background and dynamic actors. These reconstructed representations are converted into textured meshes with base materials.

In addition to geometry and base texture, LightSim estimates outdoor illumination as a high-dynamic-range (HDR) sky dome, as the sun and sky are the main light sources in outdoor daytime scenes. With the sensor data and extracted geometries, we first estimate an incomplete panorama image, then complete it to get a full 360° view of the sky. We then use this panorama image along with GPS information to generate an HDR sky dome, which accurately estimates sun intensity, sun direction, and sky appearance. This results in relightable digital twins with editable geometry and lighting.

Step 2: Neural Lighting Simulation of Dynamic Urban Scenes

We can modify the digital twins, such as by adding or removing actors, altering actor trajectories, or changing the lighting, to generate an augmented reality representation. Given the augmented reality representation, LightSim performs physically-based rendering to generate lighting-relevant data about the modified scene, such as depth and shadows. Using this lighting-relevant data along with the estimated source and target lighting conditions of the scene, we perform neural deferred rendering to render camera videos of the simulated scene variations with relighting. The workflow is shown below.

While the physically-based rendered images capture scene relighting effects well, the rendering results lack realism (e.g., they may contain blurriness, unrealistic surface reflections and boundary artifacts) due to imperfect geometry and noise in material/lighting decomposition. Therefore, we propose photorealism-enhanced neural deferred rendering. We use an image synthesis network that takes the source image and pre-computed buffers of lighting-relevant data generated by the rendering engine to produce the final relit image. We also provide the network with the environment maps for enhanced lighting context and formulate a novel paired-data training scheme by leveraging the digital twins to generate synthetic paired images.

To ensure that our rendering network maintains controllable lighting and is realistic, we train it with a combination of synthetic and real-world data. We take advantage of the fact that our digital twin reconstructions are derived from real-world data, and that our physically-based renderer can generate paired data of different source and target lightings of the same scene. This enables two main data pairs (sim-to-sim and sim-to-real) for training the network to learn the relighting task with enhanced realism. Our training objective consists of a photometric loss, a perceptual loss, and an edge-based content-preserving loss:

Scene Relighting

We now show results highlighting the capabilities of LightSim. First, LightSim can render the same scene under new lighting conditions in a temporally consistent manner. We attach the HDR sky dome on top of the video for better reference. As shown in the video, the new sun position and sky appearance result in modified shadows and appearance of the scene. LightSim can conduct scene relighting in a physically-plausible manner.

LightSim can do this at scale, generating new temporally-consistent and 3D-aware lighting variations of the same scene from a library of estimated and real HDR sky domes.

Shadow Editing

LightSim’s lighting representation is controllable and allows for manipulation of the sun’s direction, resulting in directionally-relevant lighting changes and updated shadows. We rotate the HDR sky dome and pass it to our neural deferred rendering module to produce the following video.

We can perform shadow editing at scale on a variety of scenes and lighting conditions. The simulated videos are temporally consistent as we synthesize lighting variations (through random peak direction and intensity scaling) for robustness during training.

Lighting-Aware Actor Insertion

In addition to modifying the lighting, LightSim can perform realistic, lighting-aware insertion of uncommon objects such as construction barriers. These inserted objects feature updated shadows, accurate occlusion, and spatial consistency across the full camera configuration.

Scene 1

Scene 2

Ablation Study

We showcase the importance of several key components in training our neural deferred rendering module. The FID score is reported as a quantitative perceptual quality measure of the rendered images. We first ablate the content-preserving loss and find that proper loss weight helps the model reduce synthetic mesh-like artifacts (compared to weight = 0) while properly simulating new lighting effects (compared to weight = 800).

Moreover, sim-real and identity data pairs provide useful regularization for the neural deferred rendering module by reducing visual artifacts caused by imperfect geometry. Removing those data pairs leads to less realistic simulation results and worse FID scores.

Finally, the rendering buffers and shadow maps play an important role in realistically simulating intricate lighting effects such as highlights and shadows. We observe unrealistic color and missing cast shadows if the pre-computed buffers and shadow maps are removed.

Lighting Estimation Evaluation via Actor Insertion

For realistic camera simulation, LightSim estimates scene lighting more accurately than prior work. Here we demonstrate inserting a green vehicle actor into a scene using HDR sky domes estimated via different approaches. LightSim’s estimated lighting more accurately captures the sun direction and intensity for more realistic shadows (as shown in the top zoom-in). Additionally, our reconstruction of the full scene allows for accurate modelling of inter-object lighting effects, such as the shadow cast by the dynamic actor onto the green vehicle.

Controllable Camera Simulation

Combining all these capabilities results in controllable, diverse, and realistic camera simulation with LightSim. Here we show simulated scene variations with an actor cutting into the SDV’s lane along with inserted traffic barriers, resulting in a completely new scenario with generated video data under multiple lighting conditions.

Here is another example, where we insert barriers and replace all the scene actors with a completely new set of actors reconstructed from another scene. The actors are seamlessly inserted into the scenario with the new target lighting.

Generalization to nuScenes

Since our neural deferred rendering network is trained on multiple logs, LightSim can generalize to new scenes. We now showcase LightSim’s ability to generalize to driving scenes in nuScenes. We build lighting-aware digital twins for each scene, then apply a neural deferred rendering model pre-trained on PandaSet. LightSim transfers well and performs scene relighting robustly.

Conclusion

In this paper, we aimed to build a lighting-aware camera simulation system to improve robot perception. Towards this goal, we presented LightSim, which builds lighting-aware digital twins from real-world data; modifies them to create new scenes with different actor layouts, SDV viewpoints, and lighting conditions; and performs scene relighting to enable diverse, realistic, and controllable camera simulation that produces spatially- and temporally-consistent videos. We demonstrated LightSim’s capabilities to generate new scenarios with camera video and leveraged LightSim to significantly improve object detection performance. We plan to further enhance our simulator by incorporating material model decomposition, local light source estimation, and weather simulation.

BibTeX

@inproceedings{
  pun2023neural,
  title={Neural Lighting Simulation for Urban Scenes},
  author={Ava Pun and Gary Sun and Jingkang Wang and Yun Chen and Ze Yang and Sivabalan Manivasagam and Wei-Chiu Ma and Raquel Urtasun},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023},
  url={https://openreview.net/forum?id=mcx8IGneYw}
}

Neural Lighting Simulation for Urban Scenes