
DIO: Decomposable Implicit 4D Occupancy-Flow World Model
Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun
ICRA 2023 (Award Finalist)
Alexander Cui, Sergio Casas, Kelvin Wong, Simon Suo, Raquel Urtasun
Left: input scene in a global coordinate frame that is subjected to rotations. Middle: prediction outputs when the scene is encoded in a global coordinate vary when the viewpoint changes. Right: the outputs of our viewpoint-invariant motion forecasting approach, GoRela, are unaffected by SE(3) transformations of the input.
Helps the model generalize by greatly reducing the space of the problem domain
Makes learning more sample efficient, making training faster while requiring less data (e.g., removing the need for any data augmentation during training in the form of viewpoint translation and rotation)
Keeps inference highly resource-efficient as the scene encoding (which is the heaviest module) only needs to be processed once
Enables caching of any computation that depends solely on the static parts of the scene, allowing our model to compute the map embeddings offline, thus saving critical processing time on the road.
Agents' history (i.e., past trajectory) is encoded into an agent embedding with a 1d CNN + GRU
The map is represented as a lane graph, and encoded with a graph neural network (GNN)
A heterogeneous GNN fuses the information from both sources to refine the agent and map embeddings
A goal decoder predicts a trajectory end point heatmap over lane graph nodes for each agent
A set of K goals are sampled greedily to tradeoff precision and coverage
Given the sampled goals and agent embeddings, K trajectories are completed as the future hypothesis
The diagram above shows how the heterogeneous message passing is carried out. Here, x are the node embeddings (nodes can represent map vertices and agents), j is the node receiving the messages from its neighbors in order to update its embedding, and i is one of those neighbors sending a message.
<span > Our method attains a great understanding of the environment, allowing it to predict the future trajectory of a target agent and multiple interactive agents taking into account their past trajectories, the map and other surrounding agents .</span>
At the time of submission, GoRela attained the best results among published methods in the Argoverse 2 Motion dataset, demonstrating the performance of our method in urban scenes.
Qualitative comparison in Argoverse 2 Motion Dataset
Benchmark against state-of-the-art in Argoverse 2 Motion Dataset
The same architecture also works great in highway scenarios, such as those in our internal dataset HighwaySim. Below we show 2 examples as well as quantitative metrics that show it outperforms the state-of-the-art.
Qualitative comparison in our internal HighwaySim dataset
Benchmark against state-of-the-art in our internal HighwaySim dataset
<span > Moreover, our method is viewpoint invariant! Below we show that Gorela's outputs are not affected by rotations of the scene, while the other methods are. In particular, we overlay the predictions for 2 rotations of the input scene in red and blue (which renders magenta when overlapping).</span>
Achieving this invariance is important for deployment so that (1) the prediction is fast as all agents can be encoded in parallel, and (2) the prediction is robust to different poses of the ego vehicle (as this is typically the coordinate frame chosen for methods that perform scene encoding in global frame). Beyond this, however, we show that this invariance also brings gains in sample efficiency, i.e., the number of training samples needed to attain certain level of performance is lower for our method than the baselines.
Removing the heterogeneous encoder entirely harms performance (although perhaps less than expected)
Swapping our custom HMP encoder with HEAT graph layer increases the error almost as much as not having an actor-map encoder
Replacing our proposed goal-based decoder with a simple MLP that regresses the trajectory from the encoder actor embeddings makes the most difference
Our proposed custom greedy goal sampler is better than naive top-k by a large margin
Removing the edge features (pair-wise relative information from PairPose) from the graph harms performance, which shows that viewpoint invariance also helps at convergence
We have proposed Gorela, an innovative motion forecasting model that tackles the prediction of multi-agent multi-modal trajectories on both urban and highway roads. As shown by thorough experiments, our model achieves remarkable performance through several contributions: (i) a viewpoint-invariant architecture that makes learning more efficient thanks to the proposed pair-wise relative positional encoding, (ii) a versatile graph neural network that understands interactions in heterogeneous spatial graphs with agent and map nodes related by multi-edge adjacency matrices (e.g., lane-lane, agent-lane, agent-agent), and (iii) a probabilistic goal-based decoder that leverages the lane-graph to propose realistic goals paired up with a simple greedy sampler that encourages diversity. We hope our work unlocks more research in efficient architectures to model geometric relationships between different entities in a graph.
@article{cui2022gorela,
title={GoRela: Go Relative for Viewpoint-Invariant Motion Forecasting},
author={Cui, Alexander and Casas, Sergio and Wong, Kelvin and Suo, Simon and Urtasun, Raquel},
journal={arXiv preprint arXiv:2211.02545},
year={2022}
}
Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun
Ben Agro, Sergio Casas, Patrick Wang, Thomas Gilles, Raquel Urtasun
Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, Raquel Urtasun
Yun Chen*, Matthew Haines*十, Jingkang Wang, Krzysztof Baron-Lis, Sivabalan Manivasagam, Ze Yang, Raquel Urtasun
UniCal: Unified Neural Sensor Calibration
Chris Zhang, Sourav Biswas, Kelvin Wong, Kion Fallah, Lunjun Zhang, Dian Chen, Sergio Casas, Raquel Urtasun
Yun Chen*, Jingkang Wang*, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun
Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun
Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun
Jack Lu†*, Kelvin Wong*, Chris Zhang, Simon Suo, Raquel Urtasun