GoRela: Go relative for viewpoint-invariant motion forecasting

ICRA 2023 (Award Finalist)

Alexander Cui, Sergio Casas, Kelvin Wong, Simon Suo, Raquel Urtasun

by Waabi

Video

PDF

Supplementary

The task of motion forecasting is critical for self-driving vehicles (SDVs) to be able to plan a safe maneuver. Towards this goal, modern approaches reason about the map, the agents' past trajectories and their interactions in order to produce accurate forecasts. The predominant approach has been to encode the map and other agents in the reference frame of each target agent. However, this approach is computationally expensive for multi-agent prediction as inference needs to be run for each agent. To tackle the scaling challenge, the solution thus far has been to encode all agents and the map in a shared coordinate frame (e.g., the SDV frame). However, this is sample inefficient and vulnerable to domain shift (e.g., when the SDV visits uncommon states). In contrast, we propose an efficient shared encoding for all agents and the map without sacrificing accuracy or generalization. Towards this goal, we leverage pair-wise relative positional encodings to represent geometric relationships between the agents and the map elements in a heterogeneous spatial graph. This parameterization allows us to be invariant to scene viewpoint, and save online computation by re-using map embeddings computed offline. Our decoder is also viewpoint agnostic, predicting agent goals on the lane graph to enable diverse and context-aware multimodal prediction. We demonstrate the effectiveness of our approach on the urban Argoverse 2 benchmark as well as a novel highway dataset.

Overview

Left: input scene in a global coordinate frame that is subjected to rotations. Middle: prediction outputs when the scene is encoded in a global coordinate vary when the viewpoint changes. Right: the outputs of our viewpoint-invariant motion forecasting approach, GoRela, are unaffected by SE(3) transformations of the input.

Video

0:000:00

Method

Helps the model generalize by greatly reducing the space of the problem domain
Makes learning more sample efficient, making training faster while requiring less data (e.g., removing the need for any data augmentation during training in the form of viewpoint translation and rotation)
Keeps inference highly resource-efficient as the scene encoding (which is the heaviest module) only needs to be processed once
Enables caching of any computation that depends solely on the static parts of the scene, allowing our model to compute the map embeddings offline, thus saving critical processing time on the road.

Agents' history (i.e., past trajectory) is encoded into an agent embedding with a 1d CNN + GRU
The map is represented as a lane graph, and encoded with a graph neural network (GNN)
A heterogeneous GNN fuses the information from both sources to refine the agent and map embeddings
A goal decoder predicts a trajectory end point heatmap over lane graph nodes for each agent
A set of K goals are sampled greedily to tradeoff precision and coverage
Given the sampled goals and agent embeddings, K trajectories are completed as the future hypothesis

The diagram above shows how the heterogeneous message passing is carried out. Here, x are the node embeddings (nodes can represent map vertices and agents), j is the node receiving the messages from its neighbors in order to update its embedding, and i is one of those neighbors sending a message.

Results

<span > Our method attains a great understanding of the environment, allowing it to predict the future trajectory of a target agent and multiple interactive agents taking into account their past trajectories, the map and other surrounding agents .</span>

At the time of submission, GoRela attained the best results among published methods in the Argoverse 2 Motion dataset, demonstrating the performance of our method in urban scenes.

Qualitative comparison in Argoverse 2 Motion Dataset

Benchmark against state-of-the-art in Argoverse 2 Motion Dataset

The same architecture also works great in highway scenarios, such as those in our internal dataset HighwaySim. Below we show 2 examples as well as quantitative metrics that show it outperforms the state-of-the-art.

Qualitative comparison in our internal HighwaySim dataset

Benchmark against state-of-the-art in our internal HighwaySim dataset

<span > Moreover, our method is viewpoint invariant! Below we show that Gorela's outputs are not affected by rotations of the scene, while the other methods are. In particular, we overlay the predictions for 2 rotations of the input scene in red and blue (which renders magenta when overlapping).</span>

Achieving this invariance is important for deployment so that (1) the prediction is fast as all agents can be encoded in parallel, and (2) the prediction is robust to different poses of the ego vehicle (as this is typically the coordinate frame chosen for methods that perform scene encoding in global frame). Beyond this, however, we show that this invariance also brings gains in sample efficiency, i.e., the number of training samples needed to attain certain level of performance is lower for our method than the baselines.

Removing the heterogeneous encoder entirely harms performance (although perhaps less than expected)
Swapping our custom HMP encoder with HEAT graph layer increases the error almost as much as not having an actor-map encoder
Replacing our proposed goal-based decoder with a simple MLP that regresses the trajectory from the encoder actor embeddings makes the most difference
Our proposed custom greedy goal sampler is better than naive top-k by a large margin
Removing the edge features (pair-wise relative information from PairPose) from the graph harms performance, which shows that viewpoint invariance also helps at convergence

Conclusion

We have proposed Gorela, an innovative motion forecasting model that tackles the prediction of multi-agent multi-modal trajectories on both urban and highway roads. As shown by thorough experiments, our model achieves remarkable performance through several contributions: (i) a viewpoint-invariant architecture that makes learning more efficient thanks to the proposed pair-wise relative positional encoding, (ii) a versatile graph neural network that understands interactions in heterogeneous spatial graphs with agent and map nodes related by multi-edge adjacency matrices (e.g., lane-lane, agent-lane, agent-agent), and (iii) a probabilistic goal-based decoder that leverages the lane-graph to propose realistic goals paired up with a simple greedy sampler that encourages diversity. We hope our work unlocks more research in efficient architectures to model geometric relationships between different entities in a graph.

BibTeX

@article{cui2022gorela,
  title={GoRela: Go Relative for Viewpoint-Invariant Motion Forecasting},
  author={Cui, Alexander and Casas, Sergio and Wong, Kelvin and Suo, Simon and Urtasun, Raquel},
  journal={arXiv preprint arXiv:2211.02545},
  year={2022}
}