Skip to main content
Perception, Motion Forecasting

LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds

By October 31, 2023November 3rd, 2023No Comments

LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds

Anqi Joyce Yang, Sergio Casas, Nikita Dvornik, Sean Segal, Yuwen Xiong, Jordan Sir Kwang Hu, Carter Fang, Raquel Urtasun


A major bottleneck to scaling-up training of self-driving perception systems are the human annotations required for supervision. A promising alternative is to leverage “auto-labelling” offboard perception models that are trained to automatically generate annotations from raw LiDAR point clouds at a fraction of the cost. Auto-labels are most commonly generated via a two-stage approach – first objects are detected and tracked over time, and then each object trajectory is passed to a learned refinement model to improve accuracy. Since existing refinement models are overly complex and lack advanced temporal reasoning capabilities, in this work we propose LabelFormer, a simple, efficient, and effective trajectory-level refinement approach. Our approach first encodes each frame’s observations separately, then exploits self-attention to reason about the trajectory with full temporal context, and finally decodes the refined object size and per-frame poses. Evaluation on both urban and highway datasets demonstrates that LabelFormer outperforms existing works by a large margin. Finally, we show that training on a dataset augmented with auto-labels generated by our method leads to improved downstream detection performance compared to existing methods.


Given raw LiDAR point cloud sequences, LabelFormer generates high-quality bounding box trajectories for objects in the scene by performing trajectory-level refinement efficiently. The auto-labels can be used to augment an annotated dataset to train and improve the performance of downstream object detectors.



Modern self-driving systems typically require a large labelled dataset to train. Due to the expensive and manual human labelling procedure, there is a growing interest in developing auto-labelling systems with the goal to produce high-quality labels with a much lower compute cost. In this work, we study the setting where a set of ground-truth labels are available to train an auto-labeller from LiDAR data. This setting is also referred to as offboard perception, which, unlike onboard perception, has access to future observations and does not have real-time constraints.


An existing two-stage offboard perception paradigm first employs an object detector and a multi-object tracker to derive coarse object trajectory initializations, and then refines each object trajectory separately. LabelFormer aims to generate higher-quality auto-labels by tackling the second-stage object trajectory refinement.

To refine an object bounding box trajectory, existing works employ a window-based approach where neural networks are applied independently at each time step over a limited temporal context. This approach is restrictive in terms of temporal context, and computationally inefficient as features from the same frame are extracted multiple times due to overlapping windows.

By contrast, LabelFormer performs trajectory-level refinement by taking the full trajectory as input, and outputting all refined bounding boxes at the same time.

Specifically, we design a transformer-based architecture, which first encodes the initial bounding box parameters and the LiDAR points at each time step independently, and then utilizes self-attention blocks to perform temporal reasoning, and finally decodes the refined bounding box parameters at each time step.

By design, LabelFormer is simple as it is consisted of one single model to jointly refine size and poses for both static and dynamic objects, effective as it leverages full temporal context and results in more accurate refined bounding boxes, and efficient as it only needs to be applied once for the entire trajectory with no redundant computation.

Auto-labelling Quality

We experiment with two real-world datasets, Argoverse 2 Sensor with the urban setting, and an in-house Highway dataset for highway driving. To directly evaluate the object trajectories, we compute mean IoU between detected bounding boxes and ground-truth bounding boxes for each object trajectory, and take the average across all objects. We plot the mean IoU gain of each auto-labelling refinement method over initialization and show that LabelFormer significantly outperforms state-of-the-art methods (on average 92% more mean IoU gains) across both datasets and different first-stage initializations.

We next show qualitative results before and after refinement with LabelFormer and state-of-the-art methods. We color the ground-truth bounding boxes in magenta and detected bounding boxes in orange.

Run-time Efficiency

We time the model inference time and average over object trajectories. In particular, on the Highway dataset with mostly dynamic objects, LabelFormer, which performs trajectory-level refinement, is 2.7 times faster than 3DAL which applies window-based refinement for dynamic objects. On the Argoverse dataset with 52% static objects, LabelFormer is still slightly faster than 3DAL with a non window-based static branch.

Training Downstream Detector With Auto-labels

To show the auto-labels are actually useful, we apply different offboard perception methods to auto-label 500 additional Highway sequences. We augment the Highway training dataset consisted of 150 human-labelled sequences with these 500 auto-labelled sequences to train a downstream onboard object detector. In the figure below, we show the Average Precision (AP) gain of the downstream detector when trained with the augmented auto-labelled dataset vs. trained only with the smaller human-labelled dataset. The figure shows that training with auto-labels has positive gains in general. In particular, when trained with the additional LabelFormer auto-labels, the downstream object detector has the most AP gain, especially at higher IoUs.


To understand the effect of each component, we conduct ablation studies on the Highway dataset with VoxelNeXt initialization. We plot mean IoU gains over initialization.

On the left we ablate encoder input and augmentation. The results show that bounding box encoding, point cloud encoding and the trajectory perturbation augmentation all contribute to the final success of LabelFormer.

On the right we study the effect of temporal context length. Specifically, we train and evaluate LabelFormer in a window-based fashion to restrict the given temporal context. Using a window size of 10 has 10.2% relative mean IoU gain compared to using a window size of 5. Using a window size of 20 and using full temporal context have 14.0% and 22.7% relative mean IoU gain over window size of 5 respectively.

We next study the architecture design in the cross-frame attention module. The right side of the plot shows that the mean IoU gain increases steadily with more self-attention blocks. To further understand the effect of self-attention, we replace it with an MLP-based module consisted of linear layer, ReLU and LayerNorm. The results below show that the self-attention attention module with 6 blocks outperforms the MLP architecture by a large margin, demonstrating the benefits of attention with a 40.9% higher relative gain in mean IoU.


In this work, we study the trajectory refinement problem in a two-stage LiDAR-based offboard perception paradigm. Our proposed method, LabelFormer, is a single transformer-based model that leverages full temporal context of the trajectory, fusing information from the initial path as well as the LiDAR observations. Compared to prior works, our method is simple, achieves much higher accuracy and runs faster in both urban and highway domains. With the ability to auto-label a larger dataset effectively and efficiently, LabelFormer helps boost downstream perception performance, and unleashes the possibility for better autonomy systems.


        title     = {LabelFormer: Object Trajectory Refinement for Offboard Perception from Li{DAR} Point Clouds},
        author    = {Anqi Joyce Yang and Sergio Casas and Nikita Dvornik and Sean Segal and Yuwen Xiong and Jordan Sir Kwang Hu and Carter Fang and Raquel Urtasun},
        booktitle = {7th Annual Conference on Robot Learning},
        year      = {2023},
        url       = {}