Skip to main content
AutonomyPerception, Motion Forecasting

Oyster: Towards Unsupervised Object Detection from LiDAR Point Clouds

By June 2, 2023July 25th, 2023No Comments

Oyster: Towards Unsupervised Object Detection from LiDAR Point Clouds

Lunjun Zhang, Anqi Joyce Yang, Yuwen Xiong, Sergio Casas, Bin Yang, Mengye Ren, Raquel Urtasun

Abstract

In this paper, we study the problem of unsupervised object detection from 3D point clouds in self-driving scenes. We present a simple yet effective method that exploits (i) point clustering in near-range areas where the point clouds are dense, (ii) temporal consistency to filter out noisy unsupervised detections, (iii) translation equivariance of CNNs to extend the auto-labels to long range, and (iv) self-supervision for improving on its own. Our approach, OYSTER (Object Discovery via Spatio-Temporal Refinement), does not impose constraints on data collection (such as repeated traversals of the same location), is able to detect objects in a zero-shot manner without supervised fine-tuning (even in sparse, distant regions), and continues to self-improve given more rounds of iterative self-training. To better measure model performance in self-driving scenarios, we propose a new planning-centric perception metric based on distance-to-collision. We demonstrate that our unsupervised object detector significantly outperforms unsupervised baselines on PandaSet and Argoverse 2 Sensor dataset, showing promise that self-supervision combined with object priors can enable object discovery in the wild.

Overview

Video

Method

OYSTER stands for Object Discovery via Spatio-Temporal Refinement. It has two phases of training: the initial bootstrapping phase, and the self-improvement phase.

The initial bootstrapping phase takes advantage of the fact that point clouds in the near range tend to be dense and have clear object clusters, so we can obtain reasonable near-range bounding box seed pseudo-labels via point clustering. Thanks to the translation equivariance property of convolutional nets, we find that a CNN detector trained on near-range labels can generalize to longer-range in a zero-shot manner with the help of data augmentations such as ray dropping that randomly sparsify the inputs.

The self-improvement phase utilizes temporal consistency of object tracks as a self-supervision signal. We propose a Track, Refine, Retrain, Repeat framework:

  1. Given noisy detections across time, we employ an unsupervised offline tracker to find object tracks of various lengths;
  2. We discard short tracks, and refine long tracks, to prepare pseudo-labels for the next round of self-training;
  3. An object track should have the same physical object size across time, so our refinement process uses track-level information to update pseudo-labels in long tracks;
  4. We train a new detector on the updated pseudo-labels, dump its outputs as new pseudo-labels, track, refine, retrain, and repeat.

Comparison against point clustering

Below are the Bird-Eye-View (BEV) outputs of our unsupervised object detector (OYSTER) trained without any labels. We also visualize the point clustering outputs used for initial bootstrapping to kickstart our self-training process. Note that the model outputs and ground-truth labels are class-agnostic bounding boxes.

Comparison against state-of-the-art

Comparing our detector to the previous SOTA of unsupervised detection method MODEST:

Evolution of Self-Training Pseudo-Labels

We visualize the evolution of our self-training pseudo-labels from Bird-Eye-View (BEV), from the initial point clustering step to the self-training iterations later on.

As the visualization above shows, our method of iterative self-training starts with very noisy pseudo-labels but manages to remove false positives, discover missed detections, and improve the quality of unsupervised bounding boxes.

Conclusion

We have proposed a novel method, OYSTER, for unsupervised object detection from LiDAR point clouds. Using weak object priors (near-range point clustering) as a bootstrapping step, our method can train an object detector with no human annotations, by first utilizing the translation equivariance of CNNs to generate long-range pseudo-labels, and then deriving self-supervision signals from the temporal consistency of object tracks. Our proposed self-training loop is highly effective for teaching an unsupervised detector to self-improve. We validate our results on two real-world datasets, Pandaset and Argoverse 2 Sensor, where our model outperforms prior unsupervised methods by a significant margin. Making self-supervised learning work on real-world robot perception is an exciting challenge for AI, and our work takes a step towards allowing robots to make sense of the visual world without human supervision.

BibTeX

@InProceedings{zhang_cvpr2023_oyster,
      author    = {Zhang, Lunjun and Yang, Anqi Joyce and Xiong, Yuwen and Casas, Sergio and Yang, Bin and Ren, Mengye and Urtasun, Raquel},
      title     = {Towards Unsupervised Object Detection From LiDAR Point Clouds},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month     = {June},
      year      = {2023},
      pages     = {9317-9328}
  }