Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, Raquel Urtasun

Conference: ICLR 2024

Categories: Autonomy, Perception and Motion Forecasting

Abstract

Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose Copilot4D, a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer into the discrete diffusion framework with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, our model reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotic agents.

Overview

Copilot4D is the first foundation model purpose built for the physical world that can reason in 3D space and time, which is necessary for the safe navigation of the dynamic 3D world we live in. Akin to how LLMs learn by predicting the next word given previous words, our 4D world model acquires knowledge by learning to forecast how the world is going to evolve in the next few seconds given our past observations about the world. Importantly, no human supervision is required in the learning process, and the robot can learn simply by interacting and observing the world.

In our physical world setting, the observations captured by the sensors mounted on the robot play the same role as the words in LLMs, and the equivalent task of sentence completion task is to predict the future observations. This poses several challenges as sensor readings are very high dimensional when compared to words. For example, a single LiDAR sensor typically observes millions of points per second. Furthermore, the future observations depend on the actions of the robot, as the robot’s future behavior will influence both how the world will evolve as well as the perspective from which it will see that world. For example, if the self-driving vehicle is slowing down, the car behind us will likely brake to avoid a collision. To make this possible, Copilot4D has the ability to be prompted with potential future actions, and thus it can be called a world model. The following diagram summarizes what Copilot4D does.

Video

Method

We propose a scalable recipe for learning world models that is broadly applicable to many domains and many types of data. Our solution can be summarized as: Tokenize Everything, and apply Discrete Diffusion on tokenized data.

Tokenize Everything. In language modeling, all texts are first tokenized, and the language model predicts discrete indices like a classifier. In robotics, by training specialized VQ-VAEs (or vector quantised auto-encoder), any data can be tokenized as well. As such, we tokenize the 3D world by training a VQVAE on observed point cloud data building on top of our previous work, UltraLiDAR. The figure below illustrates this process (i.e., how can we go from observations to BEV tokens, and back to observations for reconstruction).

Discrete Diffusion. Discrete diffusion is a highly flexible class of generative models that can decode an arbitrary number of tokens at each step. It can also iteratively refine the already decoded tokens. Both of those properties address the shortcomings of a purely autoregressive approach: GPT can only decode one token at a time. In autonomous driving, a single observation has tens of thousands of tokens, so parallel decoding of tokens becomes a must. The diagram below illustrates how discrete diffusion can be used to autoregressively predict future frames: take past BEV tokens and actions, perform discrete diffusion to obtain the next grid of BEV tokens, add this to the history of tokens, and repeat.

Equipped with a tokenizer and a world model, we can forecast the future sensor data as follows:

Tokenize past sensor observations with the tokenizer encoder, obtaining BEV tokens for each past observation.
Use these BEV tokens to predict future BEV tokens with our world model auto-regressively, which leverages discrete diffusion.
Map the predicted futures from BEV tokens back to point clouds with the tokenizer decoder and differentiable neural depth rendering.

The last part of Copilot4D’s successful recipe is how to train. We train the world model on a mixture of training objectives to enhance its performance. 50% of the time, we train it to predict the future conditioned on the past. 40% of the time, we train the world model to denoise the past and future jointly, which is a harder pretraining task. 10% of the time, we train it to denoise each frame individually, a task that is crucial to enable classifier-free diffusion guidance at inference time.

Results

Copilot4D achieves state-of-the-art results in point cloud forecasting. In this task, models are provided with a series of past LiDAR point clouds, and are evaluated in their ability to forecast the future LiDAR point clouds that the embodied agent will observe during a particular time horizon (e.g., 3 seconds into the future). In this evaluation, Copilot4D outperformed existing approaches by a large margin. In the charts below, we show results for KITTI and nuScenes, two popular public autonomous driving datasets where methods compete to get the best performance. Chamfer distance evaluates the similarity between the true and predicted point clouds, defined as the average distance between pairs of nearest neighbor points (lower is better). Please refer to the paper for more metrics as well as show results in Argoverse 2, another popular self-driving dataset.

These quantitative differences can also be clearly observed in qualitative results. When compared to the previous state-of-the-art method, 4D-Occ, we can observe that the depth estimates from our model are much more accurate and less noisy, being able to preserve details such as the contours of traffic participants even 3 seconds into the future (see the orange highlights below). In contrast, the baselines struggle to forecast even 1 second into the future.

So far, we have shown that when trained on a particular dataset, Copilot4D can better predict the future. This result does not only hold when evaluating our method on the validation set of a particular dataset, but also when performing zero-shot transfer. In other words, our method can better generalize to different sensor platforms from other datasets it has never seen than previous methods. In this table, we show this for transferring from Argoverse 2 Sensor to KITTI, where Copilot4D achieves more than 4x improvement over the previous state-of-the-art in zero-shot transfer for point cloud forecasting.

Since Copilot4D is made of a tokenizer and a world model, let’s visualize the outputs for each component, starting with the tokenizer. In this video, we show the reconstructed point clouds from our tokenizer over a sequence of frames. We show the ground-truth side-by-side for comparison. We can see that the tokenizer can capture details in the background such as trees, as well as dynamic traffic participants like vehicles.

Next, we visualize the world model’s discrete diffusion process as well as autoregressive prediction. To predict at t, the model encodes past point clouds up to t-1 into BEV tokens. With those BEV tokens as well as the SDV action at t, it performs k steps of discrete diffusion to obtain the BEV tokens representing the next frame, t. We can then use those as history/context to repeat this process to obtain the BEV tokens representing t+1. The GIF below shows this process unrolling up to 3s into the future.

Let’s see one more example where the SDV is traveling through an intersection

Finally, we show that Copilot4D can understand the impact the self-driving vehicle future actions have on the behavior of surrounding traffic participants. The next GIF shows the different futures Copilot4D predicts when prompted with a constant velocity trajectory (left) vs. a heavy braking trajectory (right), from 0.6 seconds into the future up to 3s. We can observe how the world model understands that vehicles behind are likely to react to the ego’s actions. When prompted with the ego heavy braking trajectory, Copilot4D predicts that the vehicle behind the ego will also brake, although it will get closer to the rear bumper, reflecting that this is not a maneuver the self-driving vehicle should consider.

Conclusion

In summary, CoPilot4D is empowered by Generative AI, and represents a significant advancement in self-driving technology. It integrates Large Language Model techniques and discrete diffusion to efficiently process complex spatial-temporal data with 4-dimensions. This model’s ability to learn from unlabeled data and perform counterfactual reasoning enhances decision-making, improving safety and adaptability. It signifies a crucial step towards fully autonomous vehicles and demonstrates the transformative impact of Generative AI in practical applications.

BibTeX

@article{zhang2023learning,
  title={Learning unsupervised world models for autonomous driving via discrete diffusion},
  author={Zhang, Lunjun and Xiong, Yuwen and Yang, Ze and Casas, Sergio and Hu, Rui and Urtasun, Raquel},
  journal={ICLR},
  year={2024}
}

Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion