FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection

CoRL 2025

Anqi Joyce Yang*, James Tu*, Nikita Dvornik, Enxu Li, Raquel Urtasun


In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction workers) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection.

Video

0:000:00

Motivation

To operate safely on the road, self-driving vehicles need to detect objects from both common classes (e.g., car, truck) and rare classes (e.g., construction worker, debris). However, because long-tailed classes are less represented in the driving data, state-of-the-art object detectors struggle to recognize them well. On the other hand, vision foundation models are trained on a large corpus of internet images, and bring promising prior knowledge for long-tailed classes. In this paper, we tackle long-tailed 3D object detection by leveraging pre-trained vision foundation models, namely OWLv2 for 2D object detection, and Metric3Dv2 for monocular depth estimation. However, these vision foundation models only process 2D images, and a state-of-the-art multi-modal object detector also uses LiDAR data as input, so there is a need to design a new multi-modal fusion model that handles both LiDAR and camera sensor data, as well as new representations such as 2D object detections, dense monocular depths, and image features from vision foundation models. Towards this goal, we propose FOMO-3D, the first multi-modal 3D object detector that leverages vision foundation models in the closed-set 3D detection setting.


Method

Architecture Overview

FOMO-3D takes raw LiDAR and camera data as input, and utilizes three types of vision foundation model outputs: 2D image detections and image features from OWL, and pixel-level dense depths from Metric3D (M3D). Built upon a DETR-like two-stage detection paradigm, FOMO-3D first generates detection proposals from two complimentary LiDAR and image branches. The LiDAR branch processes input point clouds to generate accurate 3D detections. Complementary to LiDAR, the novel camera branch generates proposals for rare or small objects that are better distinguished in the image. Here we lift OWL detections into 3D, utilizing dense M3D depths and a novel frustum-based fusion module. The LiDAR-based and camera-based proposals are concatenated and de-duplicated with non-maximum suppression (NMS). We then refine the multi-modal proposals through query-based detection to incorporate additional information from LiDAR, OWL features, and object relationships. Finally, queries are decoded into object classes and BEV bounding boxes.

LiDAR Proposal Branch

The LiDAR proposal branch follows the one-stage Centerpoint architecture. It takes the current and a few LiDAR sweeps as input, and outputs 3D detection proposals.

Camera Proposal Branch

Given the high-quality zero-shot 2D detections from the OWL model, we decide to learn to lift them to 3D boxes directly. We design a novel frustum-based camera proposal branch that learns to lift 2D OWL detections into 3D using dense M3D depths and LiDAR features. Specifically, for each OWL 2D detection, we first lift it to 3D using the M3D depth and initialize an object query with the respective OWL feature token. Due to M3D depth errors, the lifted 3D position might not be accurate. To refine the initial 3D position, we leverage explicit geometry information from LiDAR data and implicit geometry information from the M3D image depths, by cross-attending each detection token to LiDAR and image features in the object frustum space. For more information, please read the method section of the paper.

Refinement Stage

After the multi-modal proposal stage, we obtain proposals from either the LiDAR branch or the camera branch. We aggregate them by simply concatenating them and deduplicating them with non-maximum suppression (NMS), and then the refinement stage refines them through object attention, LiDAR-based attention and camera attention, and we decode the final 3D bounding box and object class from the refined feature.


Quantitative Results

We conduct thorough experiments on an urban driving dataset nuScenes and an in-house highway dataset, both with heavily imbalanced real-world object class distributions. 

In particular, the nuScenes dataset features six cameras, has 18 semantic classes in the evaluation dataset, and focuses on a 50m detection range of interest around the self-driving vehicle (SDV) in the urban setting. On the contrary, the Highway dataset focuses on the long-range highway setting with a single front-view camera, and features 5 classes.

On the nuScenes dataset, following previous works, we divide the 18 classes into three groups (Many, Medium, Few) based on commonality. We compare FOMO-3D with state-of-the-art 3D object detectors and long-tailed 3D detection methods. 

Our results show that FOMO-3D outperforms all existing methods on every aggregated object group. Not only does FOMO-3D boost the mAP of Few from previous best 20.0 to 27.6, it also performs better on small objects in Many, e.g., cones and adults. 

Furthermore, per-class mAP results show that FOMO-3D surpasses previous long-tailed 3D detection methods MMF and MMLF for almost every object class.

On the Highway dataset, FOMO-3D again is better than a few state-of-the-art methods by a large margin, especially for rare classes.

To understand the performance in different ranges, we introduce three distance buckets [0, 50], [50, 200] and [200, 230] meters to the self-driving vehicle to represent near-range, mid-range and far-range detection. The bar plot below illustrates the gain of different models over the LiDAR-only baseline for different detection ranges. The full FOMO-3D model outperforms baselines specifically for rare classes and in the long range.


Qualitative Results

We show a few qualitative examples on both the nuScenes dataset and the Highway dataset as follows.

In the nuScenes example below, OWL successfully detects the child but has a false positive cone. The LiDAR-only model mis-classifies the child as an adult. FOMO-3D manages to fuse multi-modal information and foundation model priors to generate an accurate 3D bounding box of the child, while rejecting the false positive cone.

Another nuScenes example below shows that the LiDAR-only model fails to detect a cone and a construction vehicle, while FOMO-3D is able to detect and classify them successfully.

This nuScenes example shows that the LiDAR-only model fails to detect an adult and a construction worker and also outputs a false positive bicycle, while the OWL model outputs spurious boxes for vehicles on the right. FOMO-3D is able to detect the adult and the construction worker successfully, while rejecting the false positive bicycle and spurious vehicle detections.

The Highway example below shows that FOMO-3D is able to correct many errors from the OWL and LiDAR-only models, but still mis-classifies a truck as a trailer.

Another Highway example below shows a very dense traffic scene on the highway. OWL has impressive zero-shot 2D detections, but it also outputs a few false positives. The LiDAR-only model mis-classifies a vehicle as a cyclist, and also draws a false positive cone. FOMO-3D is able to correctly classify in most cases without having many false positives. All three detectors miss two cones on the right.

For more details, please refer to the appendix of the paper.


Conclusion

In this paper, we propose FOMO-3D, the first multi-modal 3D detector that leverages vision foundation models for closed-set 3D object detection. Specifically, our two-stage model incorporates image-based detections and features from OWL and monocular metric depths from Metric3D with a novel camera-based proposal branch and cross-camera-attention in the refinement stage. On both urban and highway datasets, FOMO-3D outperforms SOTA 3D detectors and LT3D methods especially on long-tailed classes and long-range objects, and ablation experiments validate the effectiveness of foundation model priors and our multi-modal fusion design. With the ability to apply powerful foundation models to a downstream long-tailed 3D object detection problem, FOMO-3D is a step towards safer autonomy systems capable of generalizing to rare or unseen events.


BibTeX

@InProceedings{yang2025fomo3d,
  title = {FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection},
  author = {Yang, Anqi Joyce and Tu, James and Dvornik, Nikita and Li, Enxu and Urtasun, Raquel},
  booktitle = {Proceedings of The 9th Conference on Robot Learning},
  pages = {5526--5556},
  year = {2025},
  editor = {Lim, Joseph and Song, Shuran and Park, Hae-Won},
  volume = {305},
  series = {Proceedings of Machine Learning Research},
  month = {27--30 Sep},
  publisher = {PMLR},
  url = {https://proceedings.mlr.press/v305/yang25e.html},
}