
DIO: Decomposable Implicit 4D Occupancy-Flow World Model
Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun
ECCV 2024
Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun
Object-based self-driving vehicle stacks produce object detections and multi-hypothesis forecasts to anticipate the potential future behaviors for those objects. A motion planner uses those forecasts to decide on a plan for the self-driving vehicle.
The traditional approach to this task is “Detection-Tracking-Prediction”, wherein sensory data are processed by an object detector to produce bounding boxes, the boxes are associated over time to produce historical tracks for each object, and the tracks and map are used to forecast future trajectories.
The track representation bottlenecks information between the rich sensory data and the prediction task
This narrow interface causes compounding errors between each task, and
The modules are not optimized jointly, so the detector doesn’t know which mistakes hurt prediction, such as incorrect detection heading.
Recently “Joint Detection and Prediction” approaches have become a popular alternative to overcome these issues. These methods process the LiDAR and map data to produce scene level features, which are used for object detection. The scene features are cropped at each detected object to produce per-object embeddings, which along with the map data is used to decode future trajectories.
These joint methods mitigate the information bottleneck, however they still rely on cascading inference where detection errors can propagate to prediction.
Our method circumvents this established cascading approach by reformulating perception and trajectory forecasting into a single trajectory refinement task. The input to our model is initial object queries and poses that represent trajectories. These queries form a volume with an object dimension, time dimension, and mode dimension. The first index t = 0 of the time dimension represents detections, and t > 0 represents the forecases. The mode dimension represents different possible futures for each object. A refinement transformer iteratively updates the query volume and poses by cross attending to inputs such as LiDAR point clouds and HD maps. The refined detections and forecasts can be read off of the poses at each block, and the final predictions are simply the poses from the last block.
Our approach uses the architecture shown below.
We use modality-specific encoders to tokenize the heterogenous input data.
The query volume is initialized from learnable mode parameters and time parameters, expanded into a 3D feature volume with object, time, and mode dimensions.
We generate initial poses for each object token using a lightweight detection network, initializing all objects and modes to be stationary at their detected pose.
The query volume is refined with cross-attention to LiDAR and map tokens, and factored self attention in the time, mode, and object dimensions.
At the end of a transformer refinement block, the refined query volume is used to update the poses
The refined queries and poses are fed back into subsequent transformer blocks, which are repeated B times.
The final poses represent the detections and forecasts for all objects in the scene.
Object self-attention attends across all object queries within the same mode and time, to model interactions and predict consistent behaviors.
Mode self-attention attends across queries within the same object and time, to allow the modes diversify and realistically cover possible futures
Time self-attention attends across queries within the trajectory (object and mode) to allow for consistent poses over time.
In lidar attention, each object query attends to the lidar tokens. To make this efficient, we use the pose associated with the object query in deformable attention, predicting several offsets originating from that pose, and limiting the attention to the interpolated tokens from those offset points
In map attention, we similarly limit cross attention to the k-nearest map tokens using the object poses over multiple time steps and modes.
To train Detra, we use a DETR-like detection loss and a Laplacian winner-takes all forecasting loss to the pose outputs of the refinement transformer, including the intermediate outputs.
Additionally, we found that it was necessary to provide a good pose initialization by supervising the initial object poses with a heatmap detection loss.
<h3> <strong> Comparison against the state-of-the-art </strong> </h3>
When evaluated against state-of-the-art joint detection and forecasting methods, DeTra is considerably more accurate. We notice that while other methods struggle with map understanding, DeTra does not, as shown in the visualisations below on Argoverse 2.
Beyond qualitative results, our experiments show that our model outperforms prior works both in Argoverse 2 Sensor as well as Waymo Open Dataset. Please see our paper for the full table of results.
<h3> <strong> Refinement ablation </strong> </h3>
In the image below, we visualize the effect of refinement for a model trained with B=3 blocks. At i = 0, we see the pose initializations with stationary forecasts, the output of the first refinement block demonstrates an approximate speed understanding, the output of the second block better understands the map, and the third block refines the final output. We find that refinement improves the detection and forecasting accuracy, particularly in map understanding.
<em> Outputs of Detra across the refinement blocks on a scene from Argoverse 2. We visualize all modes for all objects in the single image, with color representing the future time and opacity representing the product of detection and mode confidence. </em>
This is reflected in the quantitative results, where refinement improves performance, with diminishing returns from the last block.
Below, we show the same visualization across an entire sequence:
In summary, we introduced DeTra, an object detection and trajectory forecasting model that (1) re-formulates detection and forecasting as a single trajectory refinement problem, (2) presents a flexible architecture that can handle heterogeneous inputs, and (3) attains a stronger performance on Argoverse 2 and WOD compared to state of the art joint methods. We also provide extensive ablations which show that refinement is key, leveraging geometric priors in attention to sensor information is important for learning, and that all of our proposed components and design choices are important for final performance. We hope that the ideas presented in our work will enable safer and simpler self-driving systems that are less susceptible to compounding errors over separate tasks and that are easily extensible to different sensor modalities.
@misc{casas2024detra,
title = {DeTra: A Unified Model for Object Detection and Trajectory Forecasting},
author = {Casas, Sergio and Agro, Ben and Mao, Jiageng and Gilles, Thomas and Cui, Alexander and Li, Thomas and Urtasun, Raque},
year = {2024},
}
Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun
Ben Agro, Sergio Casas, Patrick Wang, Thomas Gilles, Raquel Urtasun
Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, Raquel Urtasun
Yun Chen*, Matthew Haines*十, Jingkang Wang, Krzysztof Baron-Lis, Sivabalan Manivasagam, Ze Yang, Raquel Urtasun
UniCal: Unified Neural Sensor Calibration
Chris Zhang, Sourav Biswas, Kelvin Wong, Kion Fallah, Lunjun Zhang, Dian Chen, Sergio Casas, Raquel Urtasun
Yun Chen*, Jingkang Wang*, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun
Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun
Jack Lu†*, Kelvin Wong*, Chris Zhang, Simon Suo, Raquel Urtasun