Towards scalable coverage-based testing of autonomous vehicles

CoRL 2023

James Tu, Simon Suo, Chris Zhang, Kelvin Wong, Raquel Urtasun

by Waabi

Video

PDF

Poster

To deploy autonomous vehicles (AVs) in the real world, developers must understand the conditions in which the system can operate safely. To do this in a scalable manner, AVs are often tested in simulation on parameterized scenarios. In this context, it’s important to build a testing framework that partitions the scenario parameter space into safe, unsafe, and unknown regions. Existing approaches rely on discretizing continuous parameter spaces into bins, which scales poorly to high-dimensional spaces and cannot describe regions with arbitrary shape. In this work, we introduce a problem formulation which avoids discretization — by modeling the probability of meeting safety requirements everywhere, the parameter space can be paritioned using a probability threshold. Based on our formulation, we propose GUARD as a testing framework which leverages Gaussian Processes to model probability and levelset algorithms to efficiently generate tests. Moreover, we introduce a set of novel evaluation metrics for coverage-based testing frameworks to capture the key objectives of testing. In our evaluation suite of diverse high-dimensional scenarios, GUARD significantly outperforms existing approaches. By proposing an efficient, accurate, and scalable testing framework, our work is a step towards safely deploying autonomous vehicles at scale.

Overview

This paper tackles the problem of parameterized scenario testing for autonomous vehicles (AVs). Every point in the parameter space corresponds to a concrete scenario where the AV will either pass or fail according to safety requirements. The goal of testing is to understand the AV's performance across the parameter space.

Motivation

Autonomous vehicles (AVs) can revolutionalize the way we live by drastically reducing accidents, relieving traffic congestion, and providing mobility for those who cannot drive. In order to realize this future, developers must first ask the question: "Is the AV safe enough to be deployed in the real world?" To answer this question, we must understand in which scenarios the AV can meet safety requirements. Towards this goal, it's important to build a testing framework which covers the wide range of scenarios in the AV's operational domain, and further identify if the AV is safe or unsafe in these scenarios.

Parameterized Scenario Testing

To cover the wide range of real-world scenarios in a scalable manner, the self-driving industry often relies on simulation where the traffic environment is fully controllable and long-tail events can be synthesized. A popular approach to describing real world events in simulation is through Logical Scenarios . Each Logical Scenario is designed to follow a high-level description (e.g. SDV follows its lane, actor cuts in), and exposes configurable parameters that detail the low-level characteristics of the scenario (e.g. velocity of actors, road curvature). Then, a specific combination of parameter values corresponds to a Concrete Scenario which can be executed in simulation to determine if the AV complies with functional safety requirements. This is typically captured as a binary pass or fail determined by regulatory demand. For example, an AV could fail if it violates a safety distance threshold.

Efficient and Accurate Testing

We cannot directly test every single parameter combination, since there could be infinitely many possibilities when the parameters are continuous. Instead testing involves executing a finite set of concrete scenarios and estimating if the AV will pass or fail on unseen tests. In our testing framework GUARD, we use a Gaussian Process (GP) to leverage executed tests and estimate the probability of passing or failing across the parameter space. Then, the parameter space can be partitioned into pass, fail, and unknown regions using a probability threshold.

We efficiently sample concrete tests according to two criteria. Intiutively, the samples on the boundary of passing and failing will be the most informative towards partitioning the parameter space. On the other hand, it's also beneficial to sample tests where the GP is uncertain about the outcome. In the testing process, we repeatedly sample a test according to these criteria, observe the performance metric, and update the GP.

GPs model the relationships between similar tests using kernels . Selecting good kernel hyperparameters is crucial for the GP to accurately model the pass/fail landscape. The hyperparmeter tuning process is typically tedious and unscalable to do across every different logical scenario. We make GUARD scalable by automatically learning the kernel parameters that best fits the observed data.

Compared to existing approaches, GUARD is able to achieve higher test coverage and more accurately estimate if the AV will pass or fail. In the plots below we evaluate 4 metrics:

Coverage measures the percent of the parameter space that was covered.
Balanced Accuracy measures the accuracy of the pass/fail predictions the testing framework makes across the space that is covered.
Error Recall measures the percentage of failures across the space the framework is able to discover. Identifying these failures is especially for autonomy devleopment.
False Positive Rate measures the percentage of predicted passes that are actually fails. It's crucial to keep this low as failing to catch failures can have severe consequences.

To visually demonstrate how GUARD is able to achieve superior testing performance, we visualize the parameter space. Since a high-dimensional space is difficult to visualize, we select a 2D slice. On this slice we show the ground truth pass and fail regions, as well as those predicted by different testing frameworks. GUARD is able to more accurately model the pass/fail regions since it does not discretize the parameter space.

GUARD In Practice

In practice, GUARD is also a useful tool for benchmarking different iterations of the AV. Between two versions of the AV, GUARD can identify which one is less likely to violate a safety requirement. Furthermore, we can triage specific instances of regressions. Here we show the pass/fail landscape of the two version of the autonomy and highlight the region in the parameter space where the AV went from passing to failing.

Conclusion

We propose an efficient and scalable framework for coverage-based testing. Our approach is able to achieve higher test coverage and evaluation accuracy compared to other common approaches in the industry. In practice GUARD can serve as a valuable tool for benchmarking different versions of the AV and identify specific cases of regression. This framework can be used in practice with functional safety experts defining a comprehensive set of safety requirements and a parameterized operational design domain (ODD). Our work is ultimate a step towards safely deploying AVs at scale.

BibTeX

@inproceedings{tu2023towards,
  title     = {Towards Scalable Coverage-Based Testing of Autonomous Vehicles},
  author    = {James Tu and Simon Suo and Chris Zhang and Kelvin Wong and Raquel Urtasun},
  booktitle = {Conference on Robot Learning (CoRL)},
  year      = {2023},
}