Oyster: Towards Unsupervised Object Detection from LiDAR Point Clouds
Abstract
Overview
Video
Method
OYSTER stands for Object Discovery via Spatio-Temporal Refinement. It has two phases of training: the initial bootstrapping phase, and the self-improvement phase.
The initial bootstrapping phase takes advantage of the fact that point clouds in the near range tend to be dense and have clear object clusters, so we can obtain reasonable near-range bounding box seed pseudo-labels via point clustering. Thanks to the translation equivariance property of convolutional nets, we find that a CNN detector trained on near-range labels can generalize to longer-range in a zero-shot manner with the help of data augmentations such as ray dropping that randomly sparsify the inputs.
The self-improvement phase utilizes temporal consistency of object tracks as a self-supervision signal. We propose a Track, Refine, Retrain, Repeat framework:
- Given noisy detections across time, we employ an unsupervised offline tracker to find object tracks of various lengths;
- We discard short tracks, and refine long tracks, to prepare pseudo-labels for the next round of self-training;
- An object track should have the same physical object size across time, so our refinement process uses track-level information to update pseudo-labels in long tracks;
- We train a new detector on the updated pseudo-labels, dump its outputs as new pseudo-labels, track, refine, retrain, and repeat.
Comparison against point clustering
Below are the Bird-Eye-View (BEV) outputs of our unsupervised object detector (OYSTER) trained without any labels. We also visualize the point clustering outputs used for initial bootstrapping to kickstart our self-training process. Note that the model outputs and ground-truth labels are class-agnostic bounding boxes.
Comparison against state-of-the-art
Comparing our detector to the previous SOTA of unsupervised detection method MODEST:
Evolution of Self-Training Pseudo-Labels
We visualize the evolution of our self-training pseudo-labels from Bird-Eye-View (BEV), from the initial point clustering step to the self-training iterations later on.
As the visualization above shows, our method of iterative self-training starts with very noisy pseudo-labels but manages to remove false positives, discover missed detections, and improve the quality of unsupervised bounding boxes.
Conclusion
We have proposed a novel method, OYSTER, for unsupervised object detection from LiDAR point clouds. Using weak object priors (near-range point clustering) as a bootstrapping step, our method can train an object detector with no human annotations, by first utilizing the translation equivariance of CNNs to generate long-range pseudo-labels, and then deriving self-supervision signals from the temporal consistency of object tracks. Our proposed self-training loop is highly effective for teaching an unsupervised detector to self-improve. We validate our results on two real-world datasets, Pandaset and Argoverse 2 Sensor, where our model outperforms prior unsupervised methods by a significant margin. Making self-supervised learning work on real-world robot perception is an exciting challenge for AI, and our work takes a step towards allowing robots to make sense of the visual world without human supervision.
BibTeX
@InProceedings{zhang_cvpr2023_oyster,
author = {Zhang, Lunjun and Yang, Anqi Joyce and Xiong, Yuwen and Casas, Sergio and Yang, Bin and Ren, Mengye and Urtasun, Raquel},
title = {Towards Unsupervised Object Detection From LiDAR Point Clouds},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {9317-9328}
}