In the second of our series on test case reduction we’ll look at using test case reducers for more than just debugging.
Test case reducers are tools for automatically transforming a test case demonstrating a bug in some piece of software into a smaller, simpler, variant that demonstrates the same bug. Check out our prior post on an overview of test case reduction, which centred around this algorithm:
bestTestCase = initialTestCase;
while (keepTrying()) {
for (evenSimplerTestCase : simpler(bestTestCase)) {
if (triggersBug(evenSimplerTestCase)) {
bestTestCase = evenSimplerTestCase;
break;
}
}
}
return bestTestCase;
As a reminder, keepTrying() and simpler() are built into the test case reducer: keepTrying() determines when the reducer should give up on minimising the test case further, and simpler() yields one or more test cases that are smaller and simpler than the given test case. A domain-specific reducer will implement simpler() based on baked in knowledge of the input format of the test cases it is designed to reduce, while in a domain-agnostic reducer, simpler() will make more generic simplifications, such as deleting lines from the test case.
In contrast, triggersBug() is a user-supplied procedure. It should return true if and only if the test case triggers the bug of interest, and it is up to the user to encode what it means for the bug to be triggered – this often boils down to running the system under test on the test case and checking for a particular return code or error message.
From bugs to “interestingness”
The name triggersBug() suggests that test reduction is necessarily tied to reducing tests that trigger bugs.
But actually, there is nothing special about bugs: triggersBug() is an arbitrary user-provided test, which could check for any property of interest. For example it could:
- Check that a video player consumes more than 10Gb of RAM when run on an input file
- Check that a JavaScript validator outputs some specific warning message when applied to a piece of code
- Check that a PDF viewer does not crash when used to open a PDF document
- Check that two different web browsers both accept an input as a valid html file, but render it differently
Performing test case reduction with each of these example triggersBug() functions might be useful in shedding light on properties of the system under test and/or its space of inputs. In the first case, we may discover a simple video that consumes a lot of memory when played. In the second case we’ll end up with a small piece of JavaScript that teases out the warning. In the third, somewhat trivial case, we’ll get insight into the smallest PDF document that can be considered valid. In the final case we’ll obtain a simple example that exposes a (possibly acceptable) difference in behaviour between browsers.
It seems wrong to call the procedure triggersBug() if we’re going to use it for these completely different purposes. Rather, we’re using triggersBug() to tell us whether a test case is interesting, for whatever notion of “interesting” we care about. So let’s change its name to isInteresting(), and refer to it as the “interestingness test” (a term used commonly in the test case reduction literature).
By setting up the interestingness test to check other things, test case reducers can be used for tasks such as generating high coverage test suites, program slicing, code optimisation, and even fuzzing, as we shall see.
As far as we know, the idea of using test case reduction for more than just debugging was first formally studied in the paper “Cause reduction: delta debugging, even without bugs” (author version, publisher version), by Alex Groce et al. The following examples can all be seen as instances of cause reduction.
Test case reduction for increasing test suite coverage
Suppose we’re developing some software and our regression test suite doesn’t currently cover a particular line of code L. We come across a large input that does cover L – perhaps an input from real-world usage of the system, or an input generated by a fuzzer. Thus we know that L can be covered, it’s just that we don’t have a suitably small test case in our regression test suite that covers it.
We can obtain such a test case by writing an interestingness test that returns true if and only if the line L is covered when the system under test is run on the test case. We can then run test case reduction on the large test case. The end result will be a reduced version of this test case that still covers L. The reducer will have stripped away as much of the test case as possible, so long as L is still covered.
Can we add the resulting reduced test case directly to our regression test suite? Probably not: it likely doesn’t do anything meaningful. But we could either:
- Spend a few minutes tweaking the test case to make it meaningful, equipping it with a suitable oracle, at the same time ensuring that L is still covered, or
- Use the test case to characterise the way the system under test currently behaves, so that changes in this behaviour in the future can be flagged up for investigation. This is known as “Golden Master Testing”, and can be particularly useful when working with legacy code bases – see this article and this post for more details.
The idea of reducing tests based on coverage was proposed and investigated in the “Cause Reduction” paper. More recently, Ally and his former colleagues at Google used the idea to generate new conformance test suites for Vulkan, the modern open standard for 3D graphics (a successor to OpenGL), by using the GraphicsFuzz fuzzer to generate graphics workloads that cover parts of open source graphics drivers that are otherwise uncovered by the Vulkan conformance test suite, and then reducing them down to small workloads that (after some manual tweaking) could be used to enhance the test suite. This has led to hundreds of new tests being added to the Vulkan conformance test suite – this experience report contains the technical details.
Test case reduction for program optimisation
Precimonious is a precision tuning tool that aims to improve the performance of programs that use floating-point operations. This is achieved by changing the types of floating-point variables to use lower precision, so long as certain accuracy constraints are still met and the performance is improved. This is useful in cases where the developer has cautiously used expensive double-precision floating-point operations throughout their code, to ensure high accuracy, but where in fact high precision is only required in certain key calculations, and faster, lower-precision operations can be used elsewhere in the code without negatively impacting accuracy.
This may seem unrelated to test case reduction, but the approach is based on a modified version of delta debugging (see the “Domain-agnostic reducers” part of our earlier post for details on delta debugging) where instead of deleting lines, the simpler() procedure (see the algorithm at the start of the post) involves changing the type of a variable to a lower-precision type, e.g., replacing a 64-bit double with a 32-bit float. The interestingness test, isInteresting(), checks whether the reduced-precision program still meets required accuracy constraints and also exhibits better performance. Thus this is a nice example where test case reduction can be used for program optimisation.
See the paper “Precimonious: Tuning Assistant for Floating-Point Precision” (author version, publisher version, tool), by Cindy Rubio-González et al., for more details.
Test case reduction for program understanding
Program slicing techniques aim to identify the parts of a program’s source code that may influence whether the program exhibits a particular behaviour. Traditionally, a program slicing tool would be designed for a particular programming language, with the slicing process being driven using baked-in knowledge of the syntax and semantics of the language.
A 2014 paper, “ORBS: language-independent program slicing” (author version, publisher version), by David W. Binkley et al., takes a very different approach to slicing, which is language independent: it does not require knowledge of a language’s syntax and semantics and can thus be applied in the context of essentially any language, as well as to software written in multiple programming languages.
The idea is to slice a program by deleting lines of source code that turn out to be irrelevant to a given slicing criterion. If P is the program being sliced then, roughly speaking, a slicing criterion says that:
- The program S obtained by deleting lines from P must compile and run successfully.
- The results of running P on a specified set of test vectors must be identical to the results of running S on the same set of test vectors.
When iterated, slicing in this manner ideally leaves behind only those parts of the code base that are required for the program to build, and for the given set of tests to pass. The quality of this kind of slicing depends heavily on the quality of the set of test vectors that are used: they serve as a proxy for the semantics of the language or combination of languages in which the program is written. The tests should be chosen so that they do a good job of characterising the desired behaviour that one wishes to slice with respect to. Under-specifying this behaviour will lead to relevant parts of the code being removed, while including tests that check irrelevant behaviours will lead to parts of the code being kept unnecessarily.
Again, this can be viewed as test case reduction. The program P is the test case, simpler() involves deleting lines from the program’s source files, and isInteresting() involves checking that the program still builds and that it is observationally-equivalent to the original program with respect to the test vectors of interest.
The ORBS paper provides a lot more technical detail; the definition of a slicing criterion is more nuanced than our sketch above, and the authors show that a custom deletion operator is necessary to make the approach practical.
Test case reduction for fuzzing
In a 2015 blog post, John Regehr demonstrated how a test case reducer can be used as a fuzzer.
He took this trivial C++ program:
#include <iostream> int main() { std::cout << "Hello World!++" << std::endl; }
and pushed it through the C preprocessor. Because this program includes the <iostream> header, after preprocessing it turns into a walloping 500+ Kb of code! (Yes, C++ compilers have to work hard!)
He then ran C-Reduce, a test case reducer that specifically targets C and C++ programs (discussed in our previous blog post), on the preprocessed program, with isInteresting() configured to return true if and only if the reduced program (a) compiles, and (b) prints output containing “Hello” on execution.
(Incidentally, this can be seen as an instance of observation-based program slicing, discussed above.)
After many successful and unsuccessful reduction attempts, C-Reduce was eventually able to shrink the hundreds of kilobytes of standard library code down to a very small program including just the minimum code required to print “Hello”.
What’s interesting about this is that during its journey from the large, preprocessed program to the fully minimised “Hello” program, C-Reduce produced lots and lots of gnarly, twisted C++-like strings which it fed to the C++ compiler under test as part of isInteresting(). These by-products of the reduction process turned out to trigger a large number of segfaults, internal errors and assertion failures in the GCC and Clang compilers!
This idea – which to our knowledge has remained largely unexplored beyond John’s blog post – shows that a test case reducer can be turned into a fuzzer by giving the reducer:
- An artificial but non-trivial notion of what constitutes an interesting test case
- A large initial interesting test case
and then seeing whether, among the stream of candidate reduced test cases that the reducer creates, there are tests that trigger crashes or vulnerabilities when fed directly to the system under test.
Summary
Test case reducers were originally designed for aiding the debugging process by producing small test cases demonstrating specific bugs. But actually, they can be parameterised to yield a fully general tool for program understanding, that can be used to simplify a test case subject to any criterion of interest. In this post, we have provided a brief overview of some useful criteria that go beyond bugs – coverage, performance optimisation, program slicing and fuzzing – and shown how they can allow test case reducers to be useful in practice in a variety of ways.
Bio: Ally Donaldson is a Professor at Imperial College London, and is the main author of GraphicsFuzz and spirv-fuzz, two metamorphic fuzzing tools for graphics shader compilers, each with an associated test case reducer. David MacIver is the main author of Hypothesis, a popular property-based testing tool for Python that makes heavy use of test case reduction.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.