
DIO: Decomposable Implicit 4D Occupancy-Flow World Model
Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun
ECCV 2024
Chris Zhang, Sourav Biswas, Kelvin Wong, Kion Fallah, Lunjun Zhang, Dian Chen, Sergio Casas, Raquel Urtasun
MotivationChallengesLarge-scale data is crucial for learning realistic and capable driving policies. However, most driving data that gets collected in the real world is repetitive and uninteresting. Curating and upsampling interesting cases is limited to the existing set of logs. And unlike internet data that is plentiful and cheap, naively scaling data collection to encounter more long tail scenarios is far more expensive and can be unsafe. In this work we ask: How can we continue to scale training data without relying solely on real world collection? One existing approach is to allow policies to explore novel states using algorithms like reinforcement learning in simulation. However, because other actors typically drive normally, the resulting scenarios can still be uninteresting. Another approach is to hand-design challenging synthetic scenarios, but this can be difficult to scale and realism may be lacking. Adversarial optimization can be used to automatically generate challenging scenarios, but it’s unclear how to make sure that these are actually useful for training.
We design an asymmetric self-play mechanism where challenging, solvable and realistic scenarios automatically emerge from interactions between a teacher and student.
We sample an initial scene and designate adversarial actors at random. The teacher is rewarded by having adversarial actors collide with student-controlled actors but not teacher-controlled actors. The student is rewarded for avoiding all collisions. Adversarial actions are replayed to keep the scenario the same. All policies are regularized with real data to encourage more realistic scenarios.
Both policies interact and improve together, learning to generate and solve more and more scenarios
ArchitectureWe encode K lane graph nodes and state history for N actors over H history timesteps into D-dimensional features. A transformer backbone uses factorized attention to efficiently extract features before predicting steering and acceleration for all actors. The teacher policy additionally encodes whether an actor is adversarial, and if so who it should target.
Quantitative ResultsAfter training, the student learns a more realistic and robust policy compared to baselines while still maintaining realism.
Qualitative ExamplesWe show various qualitative comparisons against baselines on Argoverse2 below. Colored actors are controlled and grey actors are replayed.
We’ve shown that self-play leads to a more robust student. Next, we evaluate the teacher’s ability to zero-shot generate scenarios for other policies. This allows us to first efficiently train the teacher at scale in lightweight simulation, before deploying it to interact with an end-to-end policy using high fidelity sensor simulation, without the need for expensive retraining.
Our results show that training using scenarios generated with asymmetric self-play leads to far more robust policies across different autonomy paradigms.
Qualitative ExamplesWe compare against an autonomy model that was trained only on real data.
We have presented an asymmetric self-play approach for learning to drive, where solvable and realistic scenarios naturally emerge from the interactions of a teacher and student policy. We have shown that the resulting student policy can power more realistic and robust traffic simulation agents across several datasets, and the teacher policy can zero-shot generalize to generating scenarios for unseen end-to-end autonomy policies without needing expensive retraining. While the results are promising, we recognize some existing limitations. Firstly, the specific type of scenarios the teacher finds is not controllable; incorporating advances in controllable traffic simulation or exploring alternative reward designs and training schemes to encourage diversity can be interesting directions to explore. Exploring alternative failure modes besides collision (e.g. off-road, unrealistic behaviors, perception failures) is another promising avenue for future work
@inproceedings{zhang2024learning,
title = {Learning to Drive via Asymmetric Self-Play},
author = {Chris Zhang and Sourav Biswas and Kelvin Wong and Kion Fallah and Lunjun Zhang and Dian Chen and Sergio Casas and Raquel Urtasun},
booktitle = {Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024},
year = {2024},
}
Christopher Diehl*, Quinlan Sykora*, Ben Agro, Thomas Gilles, Sergio Casas, Raquel Urtasun
Ben Agro, Sergio Casas, Patrick Wang, Thomas Gilles, Raquel Urtasun
Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, Raquel Urtasun
Yun Chen*, Matthew Haines*十, Jingkang Wang, Krzysztof Baron-Lis, Sivabalan Manivasagam, Ze Yang, Raquel Urtasun
UniCal: Unified Neural Sensor Calibration
Yun Chen*, Jingkang Wang*, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun
Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun
Sergio Casas*, Ben Agro*, Jiageng Mao*十, Thomas Gilles, Alexander Cui十, Thomas Li, Raquel Urtasun
Jack Lu†*, Kelvin Wong*, Chris Zhang, Simon Suo, Raquel Urtasun