Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation

ICRA 2026

Zimu Gong, Brian Zhaoning Zhang, Chris Zhang, Kelvin Wong, and Raquel Urtasun

by Waabi

Safety-critical scenarios are essential for the development of autonomous vehicles (AVs) but are rare in real-world driving data. While simulation offers a way to generate such scenarios, manually designed test cases lack scalability, and adversarial optimization often produces unrealistic behaviors. In this work, we introduce a conditional latent flow matching approach for scalable and realistic safety-critical scenario generation. Our method uses distribution matching to transform nominal scenes into safety-critical rollouts. Furthermore, we demonstrate that incorporating both simulation and real-world data enables our framework to efficiently generate diverse, data-driven scenarios. Experimental results highlight that our approach is able to more consistently and realistically generate novel safety-critical scenarios, making it a valuable tool for training and benchmarking AV systems.

Motivation

Rare events such as sudden cut-ins, near-miss interactions, and unexpected hard braking are precisely the situations where an AV's planning and decision-making policies are most challenged. Robust performance under these conditions is essential, but exposing AV systems to them in the real world is costly and dangerous. Simulation is therefore critical as it lets us evaluate behavior under safety-critical conditions before deployment.

The challenge is acquiring a sufficiently diverse and realistic set of safety-critical scenarios in the first place. Existing approaches all have shortcomings:

Hand-crafted scenarios are tedious to create and difficult to scale.
Adversarial optimization scales more easily, but real traffic participants aren't inherently adversarial. Thus the resulting scenarios often look unnatural and over-represent collisions while ignoring the rich space of near-misses.
Distribution matching / imitation learning is the right framing for realism, but it's data-hungry, and real safety-critical scenarios are rare. Naively upsampling the few we have leads to overfitting.

We want a data-driven method that produces realistic, distributionally faithful safety-critical scenarios by taking advantage of both rare real data and cheap simulated data.

Method

We propose Flow VAE, a generative framework that learns a flow in the latent space of a conditional VAE, mapping nominal scene latents to safety-critical ones.

Latent flow matching. We train a conditional VAE on a mixture of nominal and safety-critical scenarios. This gives us a stable latent space that encodes the full breadth of driving behaviors. We then freeze the VAE and train a flow-matching transformer that learns a transport between two distributions in this latent space:

Source: the prior distribution of latents (conditioned only on the scene initialization — past states and map).
Target: the safety-critical aggregate posterior — i.e., the distribution of latents the VAE encoder produces when it sees safety-critical futures.

At inference time, the prior encoder produces a latent for any nominal initialization, the flow transformer transports it toward the safety-critical region of latent space, and the VAE decoder rolls out the resulting actor states.

Why latent space? Flowing in latent space (rather than directly in trajectory space) is what gives us realism and controllability. The VAE decoder maps any intermediate point along the flow trajectory to a plausible future, so doing a partial flow (e.g., stopping at t = 0.5) interpolates smoothly between nominal and safety-critical behavior.

Why two stages? As the VAE prior and posterior shift rapidly in early iterations, the flow model ends up chasing a moving target. Empirically we found that decoupling the two stages results in more stable training.

Conditioning. The flow transformer is also conditioned on a discrete maneuver label (nominal / safety-critical / very safety-critical), automatically computed from heuristics on vehicle kinematics and time-to-collision. This gives users an explicit knob for scenario difficulty.

Architecture. Both the VAE and flow transformer share a common backbone of interleaved actor-to-map, actor-to-actor, and actor-to-time attention layers with PairPose relative positional encodings, so the model is viewpoint-invariant. We predict a separate latent per actor and decode steering and acceleration per actor.

Data: Mixing Real and Simulated

We build a mixture of nominal data, real safety-critical scenarios and simulated safety-critical scenarios

Real Safety Critical: ~500 scenarios mined from real driving logs, featuring abrupt cut-ins, aggressive braking, and other challenging situations. High fidelity, but smaller scale.
Simulated Safety Critical: ~10,000 simulated scenarios, generated by augmenting nominal traffic with a "hero actor" parameterized by an Intelligent Driver Model and scripted to perform cut-ins or hard brakes. We use rejection sampling to discard simulations that fail basic sanity checks.
Real Nominal: ~10,000 nominal real driving scenarios. These capture standard driving behaviors and situations.

During training we draw from each source with probability α_real and (1 − α_real) respectively. As we'll show, this simple knob lets us trade off scale against fidelity, and the sweet spot lies somewhere in the middle.

Results

We evaluate Flow VAE on a held-out set of real safety-critical scenarios using a suite of metrics:

minSTTC (minimum scenario time-to-collision) — lower means more challenging scenarios.
Near-Miss % — fraction of scenarios with minSTTC < 3s but no collision. Higher is better.
SCR (scenario collision rate) — should be low; collisions are not the goal.
Distribution JSD on velocity, acceleration, jerk — measures realism of agent kinematics.
Displacement Error — L2 reconstruction error to ground truth, as an additional realism check.

Generating Realistic Safety-Critical Scenarios

Compared to a base CVAE, a CVAE trained only on curated safety-critical data, and STRIVE (a SOTA adversarial-optimization baseline), Flow VAE produces the highest near-miss rate while keeping the collision rate low and matching the kinematic distribution of real driving more closely than STRIVE. STRIVE achieves a comparable near-miss rate but at the cost of a much higher collision rate (7.0%) and worse displacement error, reflecting the unrealistic-trajectory failure mode of adversarial methods.

Qualitatively, Flow VAE causes a leading actor to hard-brake on a highway, or makes a neighboring-lane actor aggressively cut in on the ego vehicle. A nice side benefit of the flow approach is that the model automatically picks which actor should interact with the ego and what maneuver to execute, and there's no need to design the interaction by hand.

Conditional Flow Ablation

Both the flow transformer and the maneuver conditioning matter. Conditioning alone is ineffective — we hypothesize it provides too much of a shortcut during VAE training and harms representation learning. Flow alone already gets us most of the way there; adding conditioning on top of flow yields the best overall model.

Sim-Real Data Composition

Sim-only data lacks fidelity; real-only data lacks scale. Mixing the two is dramatically better than either extreme, and empirically we find performance peaks around 40–60% real, where we get the highest near-miss rate (58.3%) along with the best displacement error. This validates that the simulated scenarios, despite their limited diversity, transfer meaningfully to the real evaluation distribution.

Controllability

The maneuver-label conditioning behaves as expected: as we move from "nominal" to "challenging," minSTTC drops from 3.26s to 1.97s and near-miss rate rises from 28.1% to 50.0%. The unconditional model sits in between.

We also get a second, finer-grained controllability axis for free: the flow sampling timestep. Decoding intermediate latents along the flow trajectory shows the rollout is nominal at t = 0 and becomes safety-critical at t = 1, with a smooth progression in between. Combined with the discrete maneuver label, this gives users a continuous dial for scenario difficulty.

Conclusion

We presented Flow VAE, a flow-matching transformer that learns to transport latents from nominal scene initializations to safety-critical ones. By training the underlying VAE on a mixture of nominal and safety-critical data — and the flow model on a curated mixture of real and simulated safety-critical scenarios — we generate realistic, diverse, and controllable safety-critical rollouts that outperform both imitation-only and adversarial baselines.

Our real-data curation and synthetic generation are proof-of-concept and have natural extensions: higher-fidelity vehicle models (e.g. with friction) could close the sim-to-real gap and enable richer scenarios such as slipping; scaling up data collection and synthetic generation would broaden the diversity of safety-critical events; and finer control over which maneuver is generated — not just how aggressive it is — is a promising direction for future work.

@inproceedings{gong2026conditional,
  title     = {Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation},
  author    = {Gong, Zimu and Zhang, Brian Zhaoning and Zhang, Chris and Wong, Kelvin and Urtasun, Raquel},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026},
  eprint    = {2605.04366},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO}
}