ImplicitO: Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving
Abstract
Overview
Video
Motivation
Traditional object-based autonomy systems detect objects in the scene and then predict the trajectories of those objects. Then, a motion planner uses those detections and predictions to decide on a plan for the self driving vehicle. This paradigm poses a safety concern because the number of detections is limited for efficiency, and the thresholding required to produce detections limits uncertainty propagation.
Alternatively, object-free methods output occupancy and flow for the whole scene over future time in the form of 3-dimensional dense grids (bird’s-eye view + time). Occupancy is the probability that a traffic participant overlaps with a spatio-temporal grid cell. Flow represents the velocity at that grid cell if it were occupied. These methods are computationally inefficient due to high dimensional grids and inaccurate due to the limited receptive field inherent to fully convolutional networks. Furthermore, both approaches waste computation on objects / regions that may be irrelevant to the downstream motion planner.
Method
We present a unified perception and prediction approach, ImplicitO, which implicitly represents occupancy and flow with a single neural network. ImplicitO can be queried at any continuous point in space and future time for occupancy and flow. The continuous occupancy and flow have similar meaning to their discrete counterparts, but they reflect the properties of a continuous spatio-temporal point. Occupancy is the probability that a traffic participant occupies that point in space at that future time. Flow is a 2D birds-eye view (BEV) vector representing the velocity at the query point if it were occupied.
ImplicitO avoids unnecessary computation as it can be directly queried by a motion planner only at points relevant to candidate plans.
Consider the self-driving vehicle observing a car travelling at 20 m/s. 5 seconds into the future, this car will be 100 m from where it was observed if it maintains its speed. This motivates the architecture of ImplicitO, which uses local features at the query point to predict where to “look next” in the scene for information relevant to occupancy and flow prediction, for example, back to where the LiDAR evidence is. This improves the accuracy of ImplicitO by increasing its effective receptive field compared to fully convolutional networks.
The architecture used to implement this intuition is depicted below.
The convolutional encoder generates a feature map Z from map and LiDAR input. The decoder runs in parallel across queries. For each query, the decoder:
- interpolates Z at the query location,
- uses that interpolated feature vector to predict K attention offsets to other locations in the feature map,
- interpolates Z at the offset locations for more feature vectors,
- performs cross attention across all interpolated features to generate a final feature vector z, and
- predicts occupancy and flow at each query point, using z.
For more details, the interested reader can refer to the paper.
Results
ImplicitO can accurately predict occupancy in complex urban environments, visualized below.
Occupancy-flow outputs of ImplicitO. The alpha value represents occupancy, the hue represents the flow direction, and the saturation represents the flow magnitude.
Compared against state of the art baselines, ImplicitO produces more accurate occupancy predictions, avoiding common pitfalls such as occupancy hallucination, fading occupancy over the prediction horizon, occupancy predictions that are inconsistent with the map or other actors, and miss-detections.
Quantitatively, ImplicitO outperforms all state of the art baselines in occupancy prediction accuracy, flow prediction accuracy, and calibration, in both urban and highway scenarios.
How does ImplicitO produce such accurate predictions? ImplicitO’s global attention mechanism is crucial for increasing the effective receptive field for occupancy prediction. Below we visualize ImplicitO’s attention offsets (right) alongside occupancy-flow predictions (left). We see that the model has learned to look back to the LiDAR evidence to predict occupancy and flow.
For a realistic number of query points, ImplicitO has a lower inference time than other object-free models because it only predicts occupancy and flow where it is queried.
For a more detailed comparison against baselines, below we visualize the occupancy at each time step in the prediction horizon for state of the art methods and ImplicitO. ImplicitO exhibits realistic multi-modal predictions while avoiding various failure modes of the baselines.
ImplicitO produces accurate predictions necessary for safe autonomous-driving; below we visualize the predictions of ImplicitO at the present time (Δt = 0) unrolled across sequences of urban driving.
BibTeX
@inproceedings{agro2023implicito,
title = {Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving},
author = {Agro, Ben and Sykora, Quin and Casas, Sergio and Urtasun, Raquel},
booktitle = {CVPR},
year = {2023},
}