Skip to main content
Autonomy

DeTra: A Unified Model for Object Detection and Trajectory Forecasting

By June 9, 2024July 2nd, 2024No Comments

DeTra: A Unified Model for Object Detection and Trajectory Forecasting

Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun

*=equal contribution, =work done while at waabi

Conference: ECCV 2024
Category: Autonomy
Video
PDF
Supplementary

Abstract

The tasks of object detection and trajectory forecasting play a crucial role in understanding the scene for autonomous driving. These tasks are typically executed in a cascading manner, making them prone to compounding errors. Furthermore, there is usually a very thin interface between the two tasks, creating a lossy information bottleneck. To address these challenges, our approach formulates the union of the two tasks as a trajectory refinement problem, where the first pose is the detection (current time), and the subsequent poses are the waypoints of the multiple forecasts (future time). To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects directly from LiDAR point clouds and high-definition maps. We call this model DeTra, short for object Detection and Trajectory forecasting. In our experiments, we observe that DeTra outperforms the state-of-the-art on Argoverse 2 Sensor and Waymo Open Dataset by a large margin, across a broad range of metrics. Last but not least, we perform extensive ablation studies that show the value of refinement for this task, that every proposed component contributes positively to its performance, and that key design choices were made.

Video

Motivation

Object-based self-driving vehicle stacks produce object detections and multi-hypothesis forecasts to anticipate the potential future behaviors for those objects. A motion planner uses those forecasts to decide on a plan for the self-driving vehicle. 

The traditional approach to this task is “Detection-Tracking-Prediction”, wherein sensory data are processed by an object detector to produce bounding boxes, the boxes are associated over time to produce historical tracks for each object, and the tracks and map are used to forecast future trajectories.

This paradigm has a few issues:

  1. The track representation bottlenecks information between the rich sensory data and the prediction task
  2. This narrow interface causes compounding errors between each task, and 
  3. The modules are not optimized jointly, so the detector doesn’t know which mistakes hurt prediction, such as incorrect detection heading.

Recently “Joint Detection and Prediction” approaches have become a popular alternative to overcome these issues. These methods process the LiDAR and map data to produce scene level features, which are used for object detection. The scene features are cropped at each detected object to produce per-object embeddings, which along with the map data is used to decode future trajectories. 

These joint methods mitigate the information bottleneck, however they still rely on cascading inference where detection errors can propagate to prediction. 

Method

Our method circumvents this established cascading approach by reformulating perception and trajectory forecasting into a single trajectory refinement task. The input to our model is initial object queries and poses that represent trajectories. These queries form a volume with an object dimension, time dimension, and mode dimension. The first index t = 0 of the time dimension represents detections, and t > 0 represents the forecases. The mode dimension represents different possible futures for each object. A refinement transformer iteratively updates the query volume and poses by cross attending to inputs such as LiDAR point clouds and HD maps. The refined detections and forecasts can be read off of the poses at each block, and the final predictions are simply the poses from the last block.

Our approach uses the architecture shown below.

Our detailed architecture diagram is presented below. Specifically,

  1. We use modality-specific encoders to tokenize the heterogenous input data.
  2. The query volume is initialized from learnable mode parameters and time parameters, expanded into a 3D feature volume with object, time, and mode dimensions.
  3. We generate initial poses for each object token using a lightweight detection network, initializing all objects and modes to be stationary at their detected pose. 
  4. The query volume is refined with cross-attention to LiDAR and  map tokens, and factored self attention in the time, mode, and object dimensions. 
  5. At the end of a transformer refinement block, the refined query volume is used to update the poses
  6. The refined queries and poses are fed back into subsequent transformer blocks, which are repeated B times.
  7. The final poses represent the detections and forecasts for all objects in the scene. 

We use three factorized self-attention operations for efficiency. 

  • Object self-attention attends across all object queries within the same mode and time, to model interactions and predict consistent behaviors. 
  • Mode self-attention attends across queries within the same object and time, to allow the modes diversify and realistically cover possible futures
  • Time self-attention attends across queries within the trajectory (object and mode) to allow for consistent poses over time. 

Similarly, we use specialized cross-attention mechanisms for our LiDAR and map inputs, which we found aid learning and efficiency: 

  • In lidar attention, each object query attends to the lidar tokens. To make this efficient, we use the pose associated with the object query in deformable attention, predicting several offsets originating from that pose, and limiting the attention to the interpolated tokens from those offset points
  • In map attention, we similarly limit cross attention to the k-nearest map tokens using the object poses over multiple time steps and modes.

To train Detra, we use a DETR-like detection loss and a Laplacian winner-takes all forecasting loss to the pose outputs of the refinement transformer, including the intermediate outputs. 

Additionally, we found that it was necessary to provide a good pose initialization by supervising the initial object poses with a heatmap detection loss.

Results

Comparison against the state-of-the-art

When evaluated against state-of-the-art joint detection and forecasting methods, DeTra is considerably more accurate. We notice that while other methods struggle with map understanding, DeTra does not, as shown in the visualisations below on Argoverse 2.

Beyond qualitative results, our experiments show that our model outperforms prior works both in Argoverse 2 Sensor as well as Waymo Open Dataset. Please see our paper for the full table of results.

Refinement ablation

In the image below, we visualize the effect of refinement for a model trained with B=3 blocks. At i = 0, we see the pose initializations with stationary forecasts, the output of the first refinement block demonstrates an approximate speed understanding, the output of the second block better understands the map, and the third block refines the final output. We find that refinement improves the detection and forecasting accuracy, particularly in map understanding.  

Outputs of Detra across the refinement blocks on a scene from Argoverse 2. We visualize all modes for all objects in the single image, with color representing the future time and opacity representing the product of detection and mode confidence. 

This is reflected in the quantitative results, where refinement improves performance, with diminishing returns from the last block. 

Below, we show the same visualization across an entire sequence:

Conclusion

In summary, we introduced DeTra, an object detection and trajectory forecasting model that (1) re-formulates detection and forecasting as a single trajectory refinement problem, (2) presents a flexible architecture that can handle heterogeneous inputs, and (3) attains a stronger performance on Argoverse 2 and WOD compared to state of the art joint methods. We also provide extensive ablations which show that refinement is key, leveraging geometric priors in attention to sensor information is important for learning, and that all of our proposed components and design choices are important for final performance. We hope that the ideas presented in our work will enable safer and simpler self-driving systems that are less susceptible to compounding errors over separate tasks and that are easily extensible to different sensor modalities.

BibTeX

@misc{casas2024detra,
    title     = {DeTra: A Unified Model for Object Detection and Trajectory Forecasting},
    author    = {Casas, Sergio and Agro, Ben and Mao, Jiageng and Gilles, Thomas and Cui, Alexander and Li, Thomas and Urtasun, Raque},
    year      = {2024},
    }