UnO: Unsupervised Occupancy Fields for Perception and Forecasting

Ben Agro*, Quinlan Sykora*, Sergio Casas*, Thomas Gilles, Raquel Urtasun

*=equal contribution

Conference: CVPR 2024 (Oral)

Categories: Autonomy, Perception and Motion Forecasting, Traffic Modelling

Abstract

Perceiving the world and forecasting its future state is a critical task for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world — traditionally with object detections and trajectory predictions, or temporal bird’s-eye-view (BEV) occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. This unsupervised world model can be easily and effectively transferred to downstream tasks. We tackle point cloud forecasting by adding a lightweight learned renderer and achieve state-of-the-art performance in Argoverse 2, nuScenes, and KITTI. To further showcase its transferability, we fine-tune our model for BEV semantic occupancy forecasting and show that it outperforms the fully supervised state-of-the-art, especially when labelled data is scarce. Finally, when compared to prior state-of-the-art on spatio-temporal geometric occupancy prediction, our 4D world model achieves a much higher recall of objects from classes relevant to self-driving.

Video

Foundation Models for the Real World

Motivation

In recent years, three trends have brought large progress in artificial intelligence:

Scalable architectures and algorithms (e.g., transformers)
Large and diverse datasets to feed to those algorithms (e.g., data scraped from the internet)
Accelerated and industrial scale computing to train these models.

Foundation models combine these trends, enabling powerful models that can be used for a variety of downstream applications.

However, most prior foundation models are focused on language, trained with next token prediction; a simple unsupervised task that allows for large scale training on the internet. Our goal is to instead build a foundation model for the physical world, which would unlock capabilities in many industries, including domestic care, construction, and our area of focus, transportation.

A foundation model for the physical world should be able to

Capture the 3D geometry of the world
Understand dynamics and be able to forecast future states of the world
Generalize to rare and safety critical situations
Be able to run in real-time to facilitate real-world applications
Be transferable to a wide variety of downstream tasks like object detection and trajectory forecasting

Prior foundation models for the physical world, listed below, can be categories by the modality they use for supervision. Generally, camera based methods tokenize camera images and forecast the future with autoregressive next-token prediction. However, camera data has depth ambiguities and reasoning in image-space limits the ability of these models to understand 3D geometry. Furthermore, these worlds use large scale transformer architectures that cannot perform real-time inference, and they did not demonstrate that their foundation models are adaptable to a variety of downstream tasks.

LiDAR data provides explicit 3D geometric information and as such it presents a promising avenue for building foundation models of the physical world. However, prior worlds like 4D occ struggle with forecasting and generalization, while more performant methods like our Copilot4D are too slow for real time inference; neither demonstrating adaptability to downstream tasks.

As LiDAR is the primary modality for level 4 self-driving vehicles, in this work we focus on building a LiDAR foundation model which meets our above five criteria.

Challenges

Modeling the physical world through LiDAR data brings challenges not encountered with language. Firstly, while language is low dimensional and naturally represented with discrete tokens, sensor data is both high dimensional and continuous in space and time:

As such, we cannot apply the recipes from LLMs directly to our task.

Second, while text generation is directly useful for a variety of applications, it is not immediately clear how to use a foundation model that predicts sensor data to enhance a self driving system:

Idea

Alternatively, we propose to model the world with a spatio-temporal occupancy field, which specifies the probability a particular location in space and future time is occupied.

This has a few distinct advantages:

Paralleling classification over language tokens, occupancy allows us to learn a classification task instead of regressing point clouds directly.
Occupancy abstracts away the specifics of the LiDAR sensors and material properties that are hard to learn.
Occupancy is employed by many algorithms in robotics and autonomous driving, making it directly useful to existing stacks.

The result of this architecture trained with our unsupervised task is UnO: a 4D occupancy foundation model. We also show how to transfer UnO to downstream tasks, including point cloud forecasting and semantic birds eye view occupancy forecasting.

Method

Unsupervised Task

In this section we discuss how to learn 4D occupancy fields from LiDAR data. Our unsupervised task leverages the occupancy information provided by future LiDAR data.

We assume we know the instantaneous position of the lidar sensor, labelled \(\color{blue}s_i\) below.

Tracing the LiDAR position to any lidar point \(\color{magenta}p_{ij}\) tells us which parts of the scene must be unoccupied \(\color{red}{R}_{ij}^{-}\) since otherwise the lidar point would have stopped earlier.

We also assume that there must be a small region of occupied space \(\color{green}\mathcal{R}_{ij}^{+}\) immediately after the LiDAR point which reflects the laser.

Repeating this for many LiDAR rays allows us to describe the occupancy of the visible parts of the scene, and we can supervise forecasting because we have future sensor information available in our datasets. We train the occupancy model on binary classification of these regions in space and future time, equally sample query points from positive and negative regions during training. This task leverages the ability of neural networks to interpolate and generalize to unseen areas.

The overall training procedure is pictured above: the model takes past LiDAR sweeps as input to forecast a continuous spatio-temporal occupancy field into the future, which is supervised using the information from future LiDAR sweeps.

Architecture

To facilitate this training procedure, we need an architecture that

Can output an occupancy probability at any continuous point in space and time. This is because each LiDAR ray is produced at a different continuous time.
Can efficiently represent large 4D scenes
Has a large receptive field to capture actor dynamics across large spatial areas, such as fast moving vehicles in a highway.

To meet these requirements, we choose an implicit model which ingests LiDAR data and can be queried at any point in space and future time \(\mathbf{q} = (x, y, z, t)\) for the probability of occupancy:

This architecture is based on our prior work ImplicitO, and we refer the interested reader there for further intuition and details. At a high level, we voxelize past LiDAR, pass it through a LiDAR encoder to produce a birds eye view feature map Z, which encodes geometric, dynamic, and semantic features of the input data. This feature map, along with a batch of query points is fed into the implicit decoder. This implicit decoder is designed with a deformable attention mechanism to increase its receptive field which aids in learning dynamics. It runs on all query points in parallel for efficiency.

Experiments

Geometric Occupancy

Below we visualize UnO’s occupancy at the present time, unrolled across an entire sequence from the urban driving dataset Argoverse 2. We show two views: the perspective view has occupancy coloured by z in the ego frame, while the first person view has occupancy coloured by depth. The camera images are provided for reference, and not used as input.

We observe that UnO can capture all objects in the scene, including vulnerable road users like the cyclist.

In this highway driving scenario collected from one of our trucks, UnO is able to perceive a couch on the highway, which it has never seen during training, an impressive example of generalization.

This also demonstrates aspects of UnO’s flexibility. It can adapt to

different vehicle platforms with different sensor setups (passenger car in the argoverse dataset, autonomous truck in the highway dataset)
different driving types; urban vs highway

Compared to the prior state-of-the-art occupancy foundation model, 4D Occ, UnO has far better recall of actors of interest, including rare and vulnerable classes like wheelchairs, strollers, and school busses:

Below we visualize 4D occupancy forecasts of UnO and 4D-Occ. Unlike previous visualizations, here we show what UnO forecasts into the future. The ground truth future point cloud is provided for reference.

We see that UnO captures dynamic actors like the turning vehicle, while 4D occ struggles to generalize beyond the point cloud input into 4D occupancy and predicts all actors to be static.

In this example, UnO accurately predicts the lane change of a vehicle.

We also ablated alternative occupancy pre-training objectives by keeping the architecture and training fixed.

Depth rendering: given a lidar ray, the model predicts occupancy values along that ray, and those occupancy values are trained with a nerf-like depth loss against the ground truth lidar point depth.
Free space rendering: given a lidar ray, the model predicts occupancy values along that ray, and we take the cumulative maximum occupancy values along the ray, and supervise that with binary cross entropy against the ground truth visibility information provided by the lidar point observation.
Unbalanced UnO: we use the same objective as UnO, but do not balance the positive and negative query points used during training.

UnO outperforms these alternative occupancy-based pre-training procedures, achieving much higher geometric occupancy recall across all actor classes.

This is supported by quantitative and qualitative comparisons, which show that

Depth rendering hallucinates some objects, struggles with object extent, and does not capture the motion of vehicles
Free-space rendering has poor motion predictions
Unbalanced UnO is under-confident, even on static background areas,and suffers from disappearing occupancy for moving actors.

UnO overcomes these shortcomings, exhibiting accurate 4D occupancy forecasts.

LiDAR Point Cloud Forecasting

The natural approach to LiDAR forecasting is to predict the LiDAR point cloud directly.

However, this comes with various challenges.

The model needs to either predict or encode the future sensor location in order to perform raytracing.
The model needs to memorize the intrinsics specific to the lidar sensor
The model needs to understand the reflectance and other material properties of objects in the scene.

We would prefer to abstract away these sensor details and allow the model to focus on learning the geometry and dynamics of the world.

Alternatively, to perform LiDAR forecasting we leverage UnO’s understanding of geometry and dynamics, and attach a small learned renderer to transform occupancy fields to LiDAR.

The learned renderer takes a query ray as input, and queries UnO for occupancy values along that ray. These occupancy values along with positional encodings are fed into a small MLP to regress the LiDAR point depth along that ray, which is supervised with an \(\ell_1\) loss against the ground truth depth. By querying this learned renderer based on some known LiDAR sensor intrinsics, we can transfer UnO to the task of point cloud forecasting.

Quantitatively, UnO significantly outperforms contemporary point cloud forecasting methods across a diverse range of datasets.

And we emerged as the best model in the CVPR 2024 Argoverse 2 lidar forecasting challenge.

Compared to the state of the art, UnO can more accurately predicts point clouds on moving objects like the turning vehicle in this example:

Semantic Bird’s Eye View Occupancy Forecasting

To transfer UnO to the task of semantic BEV occupancy forecasting, we start with UnO as pre-trained weights. Because the occupancy targets lie in the x-y plane, we replace z in the query points with a learned value, and fine-tune the model to forecast semantic occupancy labels with a binary cross entropy loss.

In the graph below, we plot semantic occupancy forecasting accuracy as a function of the number of available training examples.

We find that UnO’s unsupervised pre-training procedure allows it to outperform contemporary methods ImplicitO and MP3 when trained with an order of magnitude less labeled data. These methods are trained only with labeled occupancy data. At all levels of supervision, UnO brings a significant boost to performance. This is because UnO’s pretraining supervises all areas of the scene, including map topology and interactions with other non-vehicle actors, which allows it to better forecast vehicle dynamics.

Additionally, for each alternative pre-training objective we discussed earlier, we fine-tuned the model using the same procedure as for UnO. We find that UnO’s pre-training provides the best downstream semantic occupancy forecasting performance.

Compared to contemporary approaches to BEV semantic occupancy forecasting, UnO is particularly better at forecasting occupancy for relatively rare occurrences, like the large articulated vehicle.

We note that ImplicitO uses the same architecture as UnO, but without the unsupervised pre-training. UnO’s pre-training allows it to understand the road topology and how vehicles move with respect to that.

While ImplicitO, our previous state of the art BEV occupancy model, is uncertain about the intent of the turning actor, UnO correctly forecasts the turn.

Conclusion

In conclusion, we have presented UnO, a 4D occupancy foundation model which uses past LiDAR to forecast future states of the world. UnO

is adaptable to various downstream tasks, like LiDAR forecasting and semantic occupancy forecasting,
can generalize to new scenarios and vehicle platforms,
and has an efficient architecture capable of real time inference on the edge,

meeting our requirements for a foundation model for the physical world.

BibTeX

@inproceedings{agro2024uno,
    title     = {UnO: Unsupervised Occupancy Fields for Perception and Forecasting},
    author    = {Agro, Ben and Sykora, Quin and Casas, Sergio and Gilles, Thomas and Urtasun, Raquel},
    booktitle = {CVPR},
    year      = {2024},
    }

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

UnO: Unsupervised Occupancy Fields for Perception and Forecasting

Abstract

Video

Foundation Models for the Real World

Motivation

Challenges

Idea

Method

Unsupervised Task

Architecture

Experiments

Geometric Occupancy

LiDAR Point Cloud Forecasting

Semantic Bird’s Eye View Occupancy Forecasting

Conclusion

BibTeX

Waabi

Similar Posts

SaLF: Sparse Local Fields for Multi-Sensor Rendering in Real-Time

Learning to Drive via Asymmetric Self-Play

G3R: Gradient Guided Generalizable Reconstruction