Skip to main content
AutonomySimulationTraffic Modelling

TRAVL: Rethinking Closed-loop Training for Autonomous Vehicles

By May 28, 2023July 25th, 2023No Comments

Rethinking Closed-loop Training for Autonomous Vehicles

Chris Zhang,*  Runsheng Guo,*   Wenyuan Zeng,*   Yuwen Xiong,   Binbin Dai,   Rui Hu,   Mengye Ren,   Raquel Urtasun
* denotes equal contribution
Conference: ECCV 2022
Categories: Autonomy, Simulation


Recent advances in high-fidelity simulators have enabled closed-loop training of autonomous driving agents, potentially solving the distribution shift in training v.s. deployment and allowing training to be scaled both safely and cheaply. However, there is a lack of understanding of how to build effective training benchmarks for closed-loop training. In this work, we present the first empirical study which analyzes the effects of different training benchmark designs on the success of learning agents, such as how to design traffic scenarios and scale training environments. Furthermore, we show that many popular RL algorithms cannot achieve satisfactory performance in the context of autonomous driving, as they lack long-term planning and take an extremely long time to train. To address these issues, we propose trajectory value learning (TRAVL), an RL-based driving agent that performs planning with multistep look-ahead and exploits cheaply generated imagined data for efficient learning. Our experiments show that TRAVL can learn much faster and produce safer maneuvers compared to all the baselines.


We find that diverse targeted scenarios are crucial in closed-loop training benchmarks and develop an efficient learning method for policies which plan and reason in trajectory space.


Trajectory Value Learning (TRAVL)

TRAVL learns to reason in trajectory space, which allows for better long-term planning and more efficient learning.

While typical approaches directly output an instantaneous control command, we learn to predict both the immediate and long-term value of following a longer horizon trajectory. At inference, we sample a set of candidate trajectories and select the best scoring trajectory, which is executed for a short period of time before replanning.

Reasoning in trajectory space allows for efficient training with RL along with an additional model-based loss using imagined counterfactual data. We use Q-learning to supervise the long-horizon value of an executed trajectory. An approximate world model provides short-horizon supervision for counterfactual trajectories for more efficient learning.

Experiment Results

We compare methods on our closed-loop scenario benchmark set by evaluating various safety and performance metrics: success rate, collision rate, progress, minimum time-to-collision and minimum distance-to-closest-actor.

TRAVL outperforms control (C) and trajectory (T) baselines. We hypothesize that trajectory-based methods outperform their control-based counterparts due to improved long-horizon reasoning, and the additional counterfactual supervision in TRAVL allows it to outperform all baselines.

Closed-loop Benchmark Design

In this work we study how to best build training benchmarks for effective closed-loop training of autonomous vehicles.

Using our Waabi World simulator, we can create both realistic free-flow and targeted scenarios. Free-flow scenarios are similar to what we observe in real-world data, where actors follow general traffic models and scenarios are generated by sampling parameters such as density and actor speed. Targeted scenarios on the other hand are generated by enacting fine-grained control on the actors to target specific traffic situations like actor cut-ins.

Targeted vs Free-flow Scenarios

We find that for both TRAVL and the next strongest baseline Rainbow + Trajectory (RB+T), training on targeted scenarios performs better than training on free-flow scenarios, even when evaluated on free-flow scenarios. This suggests that because targeted scenarios are designed to contain more interaction, they better capture a basis of ubiquitous driving skills.

Behavioral Scale and Diversity

We find that increasing scenario diversity and scale is crucial for improving safety metrics for many models – we see metrics improve as we use more and more (Percent of Train Data) of our available scenarios for training.

In particular, behavioral variation in our scenarios is important. Previous approaches typically rely on map variation (i.e., geolocation) primarily, which we find to be less effective.

Qualitative Results

Free flow: In this first example, we see TRAVL driving in a free-flow scenario where it must merge onto the highway. We see TRAVL can execute the merge smoothly.

Actor cut-in: Next is a targeted scenario which tests a model’s ability to handle an actor cutting in. We see TRAVL reacts quickly and brakes accordingly.

Lane change: In this targeted scenario, the task is to lane change between the two actors. We see TRAVL has learned to slow down and make the lane change.

Merge: Finally, this targeted scenario initiates the agent at a very slow speed before a merge. We see TRAVL has learned to speed up to match the traffic flow before merging.


We have studied how to design traffic scenarios and scale training environments in order to create an effective closed-loop benchmark for autonomous driving. We have proposed a new method to efficiently learn driving policies which can perform long-term reasoning and planning. Our method reasons in trajectory space and can efficiently learn in closed-loop by leveraging additional imagined experiences. We provide theoretical analysis in the full paper and empirically demonstrate the advantages of our method over the baselines on our new benchmark.


  title        = {Rethinking Closed-loop Training for Autonomous Driving},
  author       = {Zhang, Chris and Guo, Runsheng and Zeng, Wenyuan and Xiong, Yuwen and Dai, Binbin and Hu, Rui and Ren, Mengye and Urtasun, Raquel},
  booktitle    = {Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXIX},
  pages        = {264--282},
  year         = {2022},
  organization = {Springer}