GoRela: Go Relative for Viewpoint-Invariant Motion Forecasting
Abstract
Overview
Video
Method
We present a viewpoint-invariant approach to motion forecasting, GoRela, which encodes the interaction between different map entities and the agents in a viewpoint agnostic representation by modelling their pairwise relative relationships. This viewpoint invariance provides several key advantages:
- Helps the model generalize by greatly reducing the space of the problem domain
- Makes learning more sample efficient, making training faster while requiring less data (e.g., removing the need for any data augmentation during training in the form of viewpoint translation and rotation)
- Keeps inference highly resource-efficient as the scene encoding (which is the heaviest module) only needs to be processed once
- Enables caching of any computation that depends solely on the static parts of the scene, allowing our model to compute the map embeddings offline, thus saving critical processing time on the road.
The overall pipeline of our method works as follows:
- Agents’ history (i.e., past trajectory) is encoded into an agent embedding with a 1d CNN + GRU
- The map is represented as a lane graph, and encoded with a graph neural network (GNN)
- A heterogeneous GNN fuses the information from both sources to refine the agent and map embeddings
- A goal decoder predicts a trajectory end point heatmap over lane graph nodes for each agent
- A set of K goals are sampled greedily to tradeoff precision and coverage
- Given the sampled goals and agent embeddings, K trajectories are completed as the future hypothesis
For this method to be viewpoint invariant, we develop a specialized graph neural network that incorporates pair-wise relative geometric information as “edge features” during message passing.
The diagram above shows how the heterogeneous message passing is carried out. Here, x are the node embeddings (nodes can represent map vertices and agents), j is the node receiving the messages from its neighbors in order to update its embedding, and i is one of those neighbors sending a message.
Results
Our method attains a great understanding of the environment, allowing it to predict the future trajectory of a target agent and multiple interactive agents taking into account their past trajectories, the map and other surrounding agents.
At the time of submission, GoRela attained the best results among published methods in the Argoverse 2 Motion dataset, demonstrating the performance of our method in urban scenes.
Qualitative comparison in Argoverse 2 Motion Dataset
Benchmark against state-of-the-art in Argoverse 2 Motion Dataset
The same architecture also works great in highway scenarios, such as those in our internal dataset HighwaySim. Below we show 2 examples as well as quantitative metrics that show it outperforms the state-of-the-art.
Qualitative comparison in our internal HighwaySim dataset
Benchmark against state-of-the-art in our internal HighwaySim dataset
Moreover, our method is viewpoint invariant! Below we show that Gorela’s outputs are not affected by rotations of the scene, while the other methods are. In particular, we overlay the predictions for 2 rotations of the input scene in red and blue (which renders magenta when overlapping).
Achieving this invariance is important for deployment so that (1) the prediction is fast as all agents can be encoded in parallel, and (2) the prediction is robust to different poses of the ego vehicle (as this is typically the coordinate frame chosen for methods that perform scene encoding in global frame). Beyond this, however, we show that this invariance also brings gains in sample efficiency, i.e., the number of training samples needed to attain certain level of performance is lower for our method than the baselines.
You may still be wondering what are the key components that make Gorela work, but we have got you covered. Our ablation study shows the following:
- Removing the heterogeneous encoder entirely harms performance (although perhaps less than expected)
- Swapping our custom HMP encoder with HEAT graph layer increases the error almost as much as not having an actor-map encoder
- Replacing our proposed goal-based decoder with a simple MLP that regresses the trajectory from the encoder actor embeddings makes the most difference
- Our proposed custom greedy goal sampler is better than naive top-k by a large margin
- Removing the edge features (pair-wise relative information from PairPose) from the graph harms performance, which shows that viewpoint invariance also helps at convergence
Conclusion
We have proposed Gorela, an innovative motion forecasting model that tackles the prediction of multi-agent multi-modal trajectories on both urban and highway roads. As shown by thorough experiments, our model achieves remarkable performance through several contributions: (i) a viewpoint-invariant architecture that makes learning more efficient thanks to the proposed pair-wise relative positional encoding, (ii) a versatile graph neural network that understands interactions in heterogeneous spatial graphs with agent and map nodes related by multi-edge adjacency matrices (e.g., lane-lane, agent-lane, agent-agent), and (iii) a probabilistic goal-based decoder that leverages the lane-graph to propose realistic goals paired up with a simple greedy sampler that encourages diversity. We hope our work unlocks more research in efficient architectures to model geometric relationships between different entities in a graph.
BibTeX
@article{cui2022gorela,
title={GoRela: Go Relative for Viewpoint-Invariant Motion Forecasting},
author={Cui, Alexander and Casas, Sergio and Wong, Kelvin and Suo, Simon and Urtasun, Raquel},
journal={arXiv preprint arXiv:2211.02545},
year={2022}
}