ImplicitO: Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving

Ben Agro*, Quinlan Sykora*, Sergio Casas*, Raquel Urtasun

* denotes equal contribution

Conference: CVPR 2023 (Highlight)

Categories: Autonomy, Perception, Motion Forecasting, Traffic Modelling

Video

PDF

Supplementary

Abstract

A self-driving vehicle (SDV) must be able to perceive its surroundings and predict the future behavior of other traffic participants. Existing works either perform object detection followed by trajectory forecasting of the detected objects, or predict dense occupancy and flow grids for the whole scene. The former poses a safety concern as the number of detections needs to be kept low for efficiency reasons, sacrificing object recall. The latter is computationally expensive due to the high-dimensionality of the output grid, and suffers from the limited receptive field inherent to fully convolutional networks. Furthermore, both approaches employ many computational resources predicting areas or objects that might never be queried by the motion planner. This motivates our unified approach to perception and future prediction that implicitly represents occupancy and flow over time with a single neural network. Our method avoids unnecessary computation, as it can be directly queried by the motion planner at continuous spatio-temporal locations. Moreover, we design an architecture that overcomes the limited receptive field of previous explicit occupancy prediction methods by adding an efficient yet effective global attention mechanism. Through extensive experiments in both urban and highway settings, we demonstrate that our implicit model outperforms the current state-of-the-art.

Overview

Explicit approaches predict whole scene occupancy and flow on a spatio-temporal grid. Our implicit approach only predicts occupancy and flow at queried continuous points, focusing on what matters for downstream planning.

Video

Motivation

Traditional object-based autonomy systems detect objects in the scene and then predict the trajectories of those objects. Then, a motion planner uses those detections and predictions to decide on a plan for the self driving vehicle. This paradigm poses a safety concern because the number of detections is limited for efficiency, and the thresholding required to produce detections limits uncertainty propagation.

Alternatively, object-free methods output occupancy and flow for the whole scene over future time in the form of 3-dimensional dense grids (bird’s-eye view + time). Occupancy is the probability that a traffic participant overlaps with a spatio-temporal grid cell. Flow represents the velocity at that grid cell if it were occupied. These methods are computationally inefficient due to high dimensional grids and inaccurate due to the limited receptive field inherent to fully convolutional networks. Furthermore, both approaches waste computation on objects / regions that may be irrelevant to the downstream motion planner.

Method

We present a unified perception and prediction approach, ImplicitO, which implicitly represents occupancy and flow with a single neural network. ImplicitO can be queried at any continuous point in space and future time for occupancy and flow. The continuous occupancy and flow have similar meaning to their discrete counterparts, but they reflect the properties of a continuous spatio-temporal point. Occupancy is the probability that a traffic participant occupies that point in space at that future time. Flow is a 2D birds-eye view (BEV) vector representing the velocity at the query point if it were occupied.

ImplicitO avoids unnecessary computation as it can be directly queried by a motion planner only at points relevant to candidate plans.

Consider the self-driving vehicle observing a car travelling at 20 m/s. 5 seconds into the future, this car will be 100 m from where it was observed if it maintains its speed. This motivates the architecture of ImplicitO, which uses local features at the query point to predict where to “look next” in the scene for information relevant to occupancy and flow prediction, for example, back to where the LiDAR evidence is. This improves the accuracy of ImplicitO by increasing its effective receptive field compared to fully convolutional networks.

The architecture used to implement this intuition is depicted below.

The convolutional encoder generates a feature map Z from map and LiDAR input. The decoder runs in parallel across queries. For each query, the decoder:

interpolates Z at the query location,
uses that interpolated feature vector to predict K attention offsets to other locations in the feature map,
interpolates Z at the offset locations for more feature vectors,
performs cross attention across all interpolated features to generate a final feature vector z, and
predicts occupancy and flow at each query point, using z.

For more details, the interested reader can refer to the paper.

Results

ImplicitO can accurately predict occupancy in complex urban environments, visualized below.

Occupancy-flow outputs of ImplicitO. The alpha value represents occupancy, the hue represents the flow direction, and the saturation represents the flow magnitude.

Compared against state of the art baselines, ImplicitO produces more accurate occupancy predictions, avoiding common pitfalls such as occupancy hallucination, fading occupancy over the prediction horizon, occupancy predictions that are inconsistent with the map or other actors, and miss-detections.

Quantitatively, ImplicitO outperforms all state of the art baselines in occupancy prediction accuracy, flow prediction accuracy, and calibration, in both urban and highway scenarios.

How does ImplicitO produce such accurate predictions? ImplicitO’s global attention mechanism is crucial for increasing the effective receptive field for occupancy prediction. Below we visualize ImplicitO’s attention offsets (right) alongside occupancy-flow predictions (left). We see that the model has learned to look back to the LiDAR evidence to predict occupancy and flow.

Left: Occupancy-flow predictions. Right: Attention offsets, where the hue is the direction and the saturation is the magnitude.

For a realistic number of query points, ImplicitO has a lower inference time than other object-free models because it only predicts occupancy and flow where it is queried.

For a more detailed comparison against baselines, below we visualize the occupancy at each time step in the prediction horizon for state of the art methods and ImplicitO. ImplicitO exhibits realistic multi-modal predictions while avoiding various failure modes of the baselines.

ImplicitO produces accurate predictions necessary for safe autonomous-driving; below we visualize the predictions of ImplicitO at the present time (Δt = 0) unrolled across sequences of urban driving.

BibTeX

@inproceedings{agro2023implicito,
    title     = {Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving},
    author    = {Agro, Ben and Sykora, Quin and Casas, Sergio and Urtasun, Raquel},
    booktitle = {CVPR},
    year      = {2023},
    }

ImplicitO: Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving