DriveGATr: Efficient Equivariant Transformer for Self-Driving Agent Modeling

CVPR 2026

Scott Xu, Dian Chen, Kelvin Wong, Chris Zhang, Kion Fallah, Raquel Urtasun

by Waabi

Accurately modeling agent behaviors is an important task in self-driving. It is also a task with many symmetries, such as equivariance to the order of agents and objects in the scene or equivariance to arbitrary roto-translations of the entire scene as a whole; i.e., SE(2)-equivariance. The transformer architecture is a ubiquitous tool for modeling these symmetries. While standard self-attention is inherently permutation equivariant, explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance. However, this approach introduces an additional cost that is quadratic in the number of agents, limiting its scalability to larger scenes and batch sizes. In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achieves SE(2)-equivariance without the computational cost of existing methods. Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elements as multivectors in the 2D projective geometric algebra and processes them with a stack of equivariant transformer blocks. Crucially, DriveGATr models geometric relationships using standard attention between multivectors, eliminating the need for costly explicit pairwise relative positional encodings. Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to the state-of-the-art in traffic simulation and establishes a superior Pareto front for performance vs computational cost.

Motivation

Understanding how traffic agents behave has many important applications in self-driving, from motion forecasting for autonomy to traffic modelling for simulation. One important property in traffic modelling is SE(2)-equivariance. Relying on data diversity to learn an approximately equivariant function is sample inefficient and leads to poor generalization. On the other hand, approaches that directly encode SE(2)-equivariance as an inductive bias into the model typically explicitly model pairwise relationships, introducing an additional cost that scales quadratically with the number of agents. Our work seeks to guarantee SE(2)-equivariance with better scalability to larger scenes and batch sizes.

Method

We present an SE(2)-equivariant transformer grounded in the framework of geometric algebra. The 2D projective geometric algebra allows for a native representation of 2D objects and operators as 8-dimensional multivectors. We encode global poses as elements of this algebra, and apply our transformer-based architecture to produce an action for each agent in the scene.

The layers of a standard transformer are adapted to act on multivectors; for example, whereas a standard linear layer takes linear combinations of its real-valued inputs, a linear layer in DriveGATr takes linear combinations of its multivector-valued inputs. There is weight tying to ensure these new layers are equivariant, and we can algebraically prove this. The scalability of DriveGATr comes from its ability to reason about pairwise geometric relationships through a standard dot-product attention mechanism, which avoids the additional computational cost of other equivariant methods.

Qualitative Results

Given the same initialization, DriveGATr can generate realistic yet diverse rollouts just like other state-of-the-art methods in traffic simulation.

To demonstrate DriveGATr’s robustness to roto-translations, we overlay rollouts from the original coordinate frame vs one rotated by 90° and translated by 100m forward. As expected, Transformer and Transformer + DRoPE are not robust to this change in coordinate frame.

Quantitative Comparison

We compare DriveGATr against the state-of-the-art on the Waymo Open Sim Agents Challenge. We also compare against three baselines representative of how state-of-the-art methods achieve SE(2)-equivariance while sharing the same base architecture and learning algorithm as DriveGATr.

DriveGATr achieves better results than non-invariant baselines and comparable results to the invariant ones. However, where DriveGATr really shines is in its scalability. Compared to the Transformer and Transformer + RPE baselines, the loss envelope of our approach is significantly lower. In particular, for any number of training FLOPs, DriveGATr achieves the lowest training loss, demonstrating its superior scalability with training compute.

Conclusion

SE(2)-equivariance is an inherent symmetry in agents modeling problems. However, existing approaches achieve this with high computational costs. We propose DriveGATr, a novel architecture for modeling traffic agents that guarantees SE(2)-equivariance yet avoids the high computational costs of existing approaches. By leveraging efficient 2D geometric algebra encodings and the equivariant layer primitives, DriveGATr is expressive, efficient and provably equivariant.

@inproceedings{xu2026drivegatr,
  title={Efficient Equivariant Transformer for Self-Driving Agent Modeling},
  author={Scott Xu and Dian Chen and Kelvin Wong and Chris Zhang and Kion Fallah and Raquel Urtasun},
  booktitle={Conference on Computer Vision and Pattern Recognition},
  year={2026},
}