Waabi World achieves unprecedented 99.7% realism score, setting the bar for the AV industry.
One of the biggest challenges in developing autonomous vehicles (AVs) is gathering enough high-quality data to test the system so that we can be confident that it can operate safely in the real world. Historically, this has involved extensive real-world driving to try to test if the vehicle can handle a wide range of complex interactions. Humans, typically triage engineers, then meticulously analyze the data to identify interesting events, such as interventions where the autonomous systems faltered, and use these instances to refine the system and improve its decision-making capabilities. The process is resource intensive, inefficient and insufficient as a solution.
The unpredictable nature of the real world means that, no matter how many miles an AV drives, it’s essentially impossible to guarantee exposure to every potential real-world situation, especially those safety-critical, low-frequency events. Life threatening accidents happen roughly once for every 10 million miles that are driven by humans, a fatality is roughly once every 100 million miles. Considering that a human-driven truck, at best, covers 100,000 miles annually, it would take a fleet of 1,000 trucks an entire year to potentially encounter just one such event. Recreating dangerous scenarios for testing purposes is not an alternative and is also ethically problematic and often practically impossible.
Simulation technology has emerged as the key solution for addressing these challenges. Importantly, the utility of simulators scales across three distinct levels as they become more sophisticated. Standard simulators can be used to support software testing, helping catch bugs and regressions that are introduced with new versions of the software. Advanced simulators can be used to understand how the system will perform and in which specific situations it will fail, helping to understand the deficiencies of the system. The final category is the holy grail where simulators can actually be used to provide robust evidence for scientifically sound safety cases. But relying on simulation for these high stakes tests introduces a new, critical hurdle: ensuring the realism of the simulator.
Crucially, realism here refers not to visual appearance, but rather to the simulator’s ability to provide a faithful representation of the physical world to the autonomous system so that it drives in a similar manner in this virtual environment to how it performs in the real world when presented with the exact same situation.
If the simulator does not meet this realism criteria, it can still serve as a test ground for basic software testing, but it would be rendered useless as a solution for assessing a system’s performance and understanding where it might fail or for building an informed safety case.
If simulator realism can be proven across a comprehensive range of situations, then driving in simulation can dramatically reduce the volume of driving in the real world required to test the system, providing a safer and more scalable testing framework for AVs, significantly enhancing the safety standards of the industry.
Measuring Realism: Outcome Over Visuals
While the industry has eagerly embraced simulation, it has largely neglected this crucial step of rigorously measuring and quantifying simulation realism. If we want to unlock AVs at scale using simulation then it is essential that we establish a clear realism metric to ensure safety and accountability across the industry.
But how do you actually measure realism? Let’s look at what we’re trying to achieve with simulation in the first place. Simulation is designed to mimic the real world so that we can confidently predict how the system will perform in reality based on its virtual performance. Therefore it follows that the most direct way to measure realism is to actually compare how AVs drive in simulated scenarios to how they drive in identical real-world scenarios.
Researchers have attempted to assess simulator realism by analyzing the similarity between the distribution of behaviors in simulation vs the real world. For example, one can compare the distribution of AV speeds, acceleration, hard breaks, safety buffer violations, etc, computed over all scenarios driven in simulation compared to the distribution measured in a real world. The assumption is that if these statistical distributions are similar, then the simulator can be considered realistic. While this approach can provide a general indication of similarity it falls short, and cannot assess if the simulator is realistic enough to assess the specific deficiencies of the system or if it can be used for validation. This is due to the fact that two systems could have similar aggregate distributions but react very differently to every unique situation. For example, a hard brake in simulation could be caused by falsely detecting pedestrians in crowded scenes vs in the real world where it could be caused by falsely detecting pieces of discarded tire, known as tire shreds, on a highway.

A far superior and trustworthy approach to measuring realism is through pair-setting. This involves meticulously recreating a set of real-world scenarios within the simulator — matching the actors, their appearance, their precise behavior, the weather, illumination and road conditions, etc — and then measuring the difference in the AV’s trajectories across both environments. This outcome-focused approach provides a concrete, measurable basis for evaluating realism.
Leveraging Digital Twins for Pair-Setting
But how do you create these meticulously matched scenarios and how do we do this at scale so that we have statistically significant evidence? The key lies in the creation of rich, data-driven digital twins of real-world scenarios which can then serve as precise, comprehensive inputs that seed the simulation, enabling the systematic generation of paired, identical tests. Recent breakthroughs in neural simulation, such as NeRF, 3D Gaussian Splatting, or UniSim, have made this kind of digital twin generation not only possible, but possible to accomplish at scale.
Importantly, these digital twins encapsulate all elements of the real world, including the precise behaviors of actors (vehicles, pedestrians, cyclists, animals, etc), their appearance, environmental conditions like weather and illumination, infrastructure such as cones and traffic lights and even subtle variations in road surface characteristics such as ice.
The digital twins can then be used once per scenario as the preroll content to seed the simulator under evaluation which is then unrolled in closed-loop iteratively. This procedure can be used to evaluate a wide range of close-loop simulators, including blackbox neural simulators that directly generate sensor observations as well as controllable simulators that model explicitly many aspects of simulation such as actor behaviors, sensor simulation, latency of the system, etc. Note that neural simulators such as Waabi World, or traditional video game or physics engine simulators are examples of controllable simulators. The AV then receives the simulated sensor data at every time stamp and generates steering and acceleration actuations that the simulator leverages to update the position of the self-driving vehicle in the virtual world. The simulator then moves the AV in the world according to its decision and vehicle dynamics, and the other actors in the scene react to it. This loop goes on for a fixed duration, typically tens of seconds, which is sufficient to capture all key aspects of any scenario.

This procedure can be used to evaluate a wide range of close-loop simulators, including blackbox neural simulators that directly generate sensor observations as well as controllable simulators that model explicitly many aspects of simulation such as actor behaviors, sensor simulation, latency of the system, etc. Neural simulators such as Waabi World, or traditional video game or physics engine simulators are all examples of controllable simulators.
The result per scenario is two detailed trajectories of the self-driving vehicle – one from the real-world test and one from the simulation – capturing the AV’s behavior in each scenario. Because both trajectories are based on the same exact underlying scenario we can then directly compute realism metrics by measuring the differences between the trajectory generated by the execution of the full autonomy system in close-loop simulation versus the trajectory executed in the real world with the same autonomy SW release.
Realism Score = (1 – average relative distance) x100
The equation for computing realism metrics.
Several factors can contribute to the difference one might observe between these two trajectories. Each module in the simulator can produce outputs that deviate from what happened in the real world, and these deviations can compound throughout the system and over time, resulting in the AV doing very different behaviors. Variations in actor behavior, differences in simulated sensor data, differences in the latency of the system compared to the actual hardware, vehicle dynamics approximations like not accurately modeling gear shifts, etc, can result in differences in how autonomy perceives and reacts to its surroundings. For example, failure to replicate a false positive that produces a hard brake in the real-world, or a vehicle encroaching next to us will result in a very different trajectory.
Setting an Industry Standard with Waabi World
At Waabi, we’ve leveraged this approach to rigorously validate the realism of our neural simulator, Waabi World. We’ve conducted extensive paired tests and achieved an unprecedented 99.7% realism score.
This astonishingly high degree of realism provides irrefutable evidence that our simulator faithfully replicates the real-world driving experience and is a monumental breakthrough for the industry. This result not only validates Waabi’s approach to safety but also establishes a transformative standard for the industry. Going forward, all AV developers using simulation need to be able to publicly demonstrate the quantified realism of their simulators. Just as we have safety standards for vehicles themselves, we must establish clear and measurable standards for the simulators on which their safety depends.
This transparency and accountability are absolutely paramount for building public trust in AV technology and I call on all AV companies to embrace this new standard and prioritize simulator realism as a collective responsibility.