March 15, 2024

Introducing Copilot4D: A Foundation Model for self-driving

by Waabi

We live in a dynamic 3D world that evolves over time, and as we interact with this world, our brains are constantly making hundreds of decisions in fractions of a second. From whether we should cross a road to when we are driving and we decide to merge into another lane, our brains have the remarkable ability to understand our 3D space throughout time (the fourth dimension) to determine the best action for us. While this might seem like second nature for many of us, it actually involves incredibly complex reasoning skills and is not so simple for artificial brains.

Take for example how we understand and engage with the world around us. We rely on our senses like sight and hearing to perceive the world, while intelligent machines rely on sensors to do so. Over the last few years, LiDAR has become the predominant sensor for intelligent machines to perceive the physical world, because it provides precise 3D information which is crucial for navigation and interaction. LiDAR measures the distance of a surface from the sensor by emitting rays of light from a pulsed laser. At a basic level, a LiDAR point is captured for every ray of light that hits an object and returns to the sensor. This function enables the machine to “see” the precise point in 3D where the object surface is.

Seen above, the same image via camera (left) and LiDAR (right). While camera images look closest to how humans see the world, LiDARs provide extremely valuable information about the 3D geometry of the scene. In this plot, LiDAR is colored by its depth measurements, i.e., how far the surface hit by that LiDAR ray is from the observer.

The challenge remains however: how can we enable these intelligent machines to reliably and efficiently extract information from these sensor readings to learn about and engage with the world in real time?

We believe the answer lies in Generative AI. Recent breakthroughs in this area have fully transformed the digital world. Large Language Models (LLMs) have demonstrated the endless possibilities that are enabled when AI is scaled to learn from massive amounts of data from the Internet. Today, these models are often referred to as foundation models due to their versatility and ability to be finetuned to perform a diverse set of applications, from mathematics and coding to text summarization and chatbots.

At Waabi, we're enabling a similar revolution in the real world by building new foundation models purposely designed for the physical world. To that effect, we’re excited to unveil Copilot4D, the first foundation model that explicitly reasons in 3D space and the fourth dimension, time, learning remarkable capabilities for interacting and acting in a dynamic world, whether in simulation, like Waabi World, or in the physical world we live in. It paves the way toward more intelligent machines, from autonomous vehicles to robotics and more.

Similar to how LLMs learn by predicting the next word in a sentence, Copilot4D learns by predicting how a machine will observe the world in the future. However, while LLMs learn from discrete tokens that represent words, LiDAR data is continuous in nature. To bridge this gap between language and the physical world, Copilot4D features a 3-stage architecture.

First, a LiDAR tokenizer abstracts continuous sensor data into a set of discrete tokens, similar to words in language.
Then, our foundation model forecasts how the world will evolve as a set of tokens, leveraging the recent breakthroughs in LLMs. Importantly, it takes into account how the future actions of the embodied AI agent will affect the world.
Finally, a LiDAR renderer brings these tokens back to LiDAR point clouds, something robots can observe just like humans see through their eyes, enabling us to learn from raw sensor recordings without requiring human supervision.

Copilot4D predicts future LiDAR point clouds from a history of past LiDAR observations, akin to how LLMs predict the next word given the preceding text. We design a 3 stage architecture that is able to exploit all the breakthroughs in LLMs to bring the first 4D foundation model.

Now that we have explained how Copilot4D works at a high-level, let’s dive deeper into its three components, starting with the tokenizer. Our tokenizer, UltraLiDAR, can abstract continuous sensor data into a grid of discrete tokens in Bird’s-Eye-View, or in other words, as if the scene was seen from a bird looking down. Each token in the grid essentially describes a local 3D neighborhood of the scene, and is the foundation that the embodied agent utilizes to understand its environment in detail.

In a similar way to how LLMs map words into sequences of discrete tokens, UltraLiDAR maps LiDAR point clouds into grids of discrete tokens.

Equipped with a set of discrete tokens representing the physical world, the foundation model can then predict the next set of tokens to forecast into the future how the scene will evolve, for example what the different vehicles and pedestrians will do. This is a similar process to a typical LLM which predicts the next word in the sentence, but instead of words, Copilot4D predicts the next version of the world around it. It is important to note that a LiDAR point cloud is much more complex and high-dimensional than a word, and therefore predicting one token at a time like it’s done in LLMs is computationally prohibitive. To overcome this challenge, we leverage discrete diffusion to predict multiple tokens in parallel, making our model much more efficient.

To bring our beliefs over the future back to a representation that machines and humans can understand, we employ a LiDAR renderer that essentially has the inverse role of the tokenizer: mapping the discrete tokens back to continuous LiDAR point clouds. To do so, we exploit state-of-the-art techniques in physics-inspired differentiable neural depth rendering in order to predict an accurate depth for each LiDAR ray.

To demonstrate the efficacy of Copilot4D we compare its performance to the state-of-the-art models in multiple public leaderboards for the task of point cloud forecasting. Models are provided with a series of past LiDAR point clouds, and are evaluated in their ability to forecast the future LiDAR point clouds that the embodied agent will observe during a particular time horizon (e.g., 3 seconds into the future). In this evaluation, Copilot4D outperformed existing approaches by a large margin.

KITTI and nuScenes are popular public autonomous driving datasets where methods compete to get the best performance. In this case, the benchmark evaluates the quality of the forecasted point clouds at 3 seconds into the future. Chamfer distance evaluates the similarity between the true and predicted point clouds (lower is better), defined as the average distance between pairs of nearest neighbor points.

Copilot4D has many exciting capabilities, enabling a plethora of applications. It can generate scenes from scratch, it can complete partial scenes, it can forecast the future based on the past, and it can do so for different counterfactual trajectories of the embodied agent. Importantly, it can learn about the world from different embodied agents (cars, trucks, robots, etc.) that can be outfitted with different types, numbers, and locations of LiDAR sensors. This provides Copilot4D with the ability to generalize to applications and situations it has not been trained on.

Shown above, three distinct scenarios generated from scratch by Copilot4D. These are hard to distinguish from any real scenario.

Conditional generations from Copilot4D. When the model is shown the tokens corresponding to the point cloud (left), it is capable of completing what lies ahead (shown on the right)

Counterfactual forecasts: When our foundation model is prompted with different future actions from the embodied agent given the same past context. On the left, we prompt the model with an action where the self-driving vehicle accelerates and gets closer to the agent in front, whereas on the right we prompt it with a braking action, thus getting further away from the vehicle ahead.

Copilot4D marks a breakthrough in how intelligent machines can leverage raw sensor data to not only understand the world they are operating in, but also how it is going to evolve in the future. It empowers intelligent machines, like self-driving vehicles, to make safer decisions that are not reactive, but proactive. For example, when the self-driving vehicle is preparing for a lane change to follow a specific route, it can prompt Copilot4D with a lane change action to understand how other vehicles in the adjacent lane will react, making sure it is safe before starting the maneuver. Copilot4D is also efficient- the compute needed is on the intelligent agent itself, and it can learn by observing the world and interacting, without requiring human supervision. We believe it is a critical lynchpin for enabling smarter, safer, and more effective autonomous machines in the real world, from self-driving vehicles to warehouse robots, drones, and more.

To learn more, read our research paper on Copilot4D.