PL Perspectives

Perspectives on computing and technology from and for those with an interest in programming languages.

Modern compilers are expected to efficiently translate code written in constantly-evolving source languages into optimized code in increasingly sophisticated and expanding machine languages; it should be no surprise that they are wont to do so imperfectly. Thus, an important question is: How can we increase our confidence that a compiler is correct? Typically, we do so using testing: we compile a large (manually curated) collection of code, and if everything works as expected, the compiler is probably doing its job. There are, however, a few flaws in this basic approach.

This blog post describes a better approach, random compiler testing, which has proved highly effective at uncovering compiler bugs. The post’s focus is on research carried out on random compiler testing for C and C++ compilers, first reported in our SPLASH/OOPSLA 2020 Distinguished paper “Random Testing for C and C++ Compilers with YARPGen.”

Random Compiler Testing and YARPGen

What is wrong with just testing a compiler by using it to compile a large collection of code? First, it is not always clear what that code should do when the compiler is correct. Suppose you compile a very large software system, such as Android or Windows, and running it produces a crash. Was a compiler bug responsible? Or was this just a bug in Android? Second, even if a compiler is capable of correctly compiling Android, this doesn’t necessarily mean it will compile the next collection of code correctly. We’d like to do a better job testing compilers.

One approach is to use random compiler testing. The idea is simple: Generate a program randomly (i.e., using randomly generated numbers to determine what sorts of expressions and statements it contains), then compile it with a compiler and see what it does. To know whether it’s doing the right thing, compile the same program with another compiler and then see if it behaves the same way as when compiled with the first. If so, that’s good; generate a new program and try again. If not, then one of the compilers must be wrong, and we can investigate the reasons why. This technique has been used to test compilers since at least the 1960s, and it is — for reasons we do not completely understand — often extremely effective in finding bugs. For example, a recent research effort by Dr. Zhendong Su’s group at ETH Zurich used random testing to find more than 1,600 compiler bugs over a period of a few years.

YARPGen

The goal of our research project was to explore two new ideas in random compiler testing, to see if they are effective. Both ideas are implemented in a tool that we created called YARPGen, and both ideas are pretty simple! We’ll describe them, but first let’s take a look at the big picture.

YARPGen merely generates random programs. To make it usable in practice, we wrote a program that runs YARPGen, runs several compilers on programs that it generates, monitors the compilers for crashes, and several other tasks. The big picture looks like this:

Random Compiler Testing with YARPGen

An important point is that we look for two very different kinds of compiler bug. First, those that cause the compiler to crash or otherwise fail to produce an executable program. Second, those where the compiler produces an executable, but that executable is somehow defective: it performs the wrong computation. In either case, our framework takes a program that triggers a compiler bug and runs C-Reduce on it. C-Reduce is an external program that takes a large C or C++ program that triggers a compiler bug and turns it into a small program that still triggers a compiler bug. Thus, our framework can produce output like this without any human intervention:

These two pieces of C code, compiled together, reveal a bug in LLVM at the -O2 optimization level. Next we’ll look at the two main research ideas that we wished to explore when we started working on YARPGen.

Generation Policies

Our first idea is motivated by the observation that, perhaps paradoxically, it can be the case that random programs, even if they are all different from each other, can end up being similar, in an important sense. How? Consider a random image generator that we’re going to use to test a new cat detection algorithm. We generate random images by assigning a random color to each pixel in the image. Unfortunately, almost every such image looks like a blotchy grey field. It might take many years (or many millenia) of testing before we randomly generate a collection of pixels that is sufficiently cat-like to trigger our algorithm. This kind of testing is probably just a waste of time. Our images lack any structure that is coordinated across multiple pixels.

YARPGen does not generate programs as sequences of random bytes: this would work every bit as poorly as the image example we just discussed. Rather, it knows how to generate code that is syntactically and semantically valid C and C++. However, even after doing this, our random programs would lack some kinds of high-level structure that we believed was needed to trigger compiler bugs. We designed a mechanism called generation policies to help add this kind of structure. 

Each generation policy applies to some part of the program being generated, and has the effect of making it more likely, or less likely, to generate certain constructs. For example, one generation policy is designed to stress-test a compiler optimization called “common subexpression elimination”, which looks for repeated occurrences of the same computation in the code being compiled, and computes it only once. So this generation policy maintains a buffer of previously generated subexpressions, and injects these into subsequent expressions.

Numeric Constants

We have several generation policies for dealing with numeric constants. Why? Consider this compiler optimization, which results in the generated code being slightly smaller:

(x + -1) – y → ~y + x

This transformation, and many similar ones, requires a specific value, -1, for it to work. If we generate numeric constants uniformly across the range of -2^31..2^31-1 or -2^63..2^63-1, then -1 will almost never appear, and we will fail to find defects in the implementations of these optimizations. Thus, we have a policy that generates numeric constants that are commonly scanned for by optimizations, such as 0, 1, -1, the largest signed integer value, etc. Other optimizations don’t require a specific constant, but rather require multiple constant values that are related to each other. For example, if the constant c1 is the bitwise complement of constant c2, then:

(( x | y) & c1 ) | (y & c2)

can be rewritten as:

x & c1

To ensure that this kind of optimization has the opportunity to happen, YARPGen ensures that some constant values that it generates are closely related to previous constants, such as being their complement, their predecessor or successor, etc.

Code Contexts

Application code is often idiomatic. For example, cryptographic code tends to have densely-packed bitwise operations. Compilers are equipped to optimize such code, but we can only expose bugs in these optimizations with reasonably high probability if we generate similarly idiomatic code. YARPGen, therefore, has a notion of “code contexts” that preferentially generate subsets of the language’s operators:

  • additive: addition, subtraction, negation
  • bitwise: binary negation, and, or, xor
  • logical: and, or, not
  • multiplicative: multiplication, division
  • bitwise-shift: binary negation, and, or, xor, right/left shifts
  • additive-multiplicative: addition, subtraction, negation, multiplication, division

A fragment of code that was generated in bitwise-shift context might look like this:

a = ~b & ((c | d) ^ (e << 4));

Code contexts are applied to random sub-regions of the generated program and can overlap each other.

Program-Wide Parameters

A final kind of generation policy that YARPGen supports is to randomly set certain parameters that apply to an entire generated program. So, for example, one program generated by YARPGen might have 80% of its variables have the type “char,” and the next one might have only 3% of these. There are a total of 23 different parameters whose values are randomly chosen at the start of program generation.

To test whether generation policies are beneficial, we conducted a study where we attempted to rediscover some known compiler bugs using both YARPGen and also a modified YARPGen that does not use generation policies at all. We found that generation policies are highly effective in helping discover compiler bugs that are triggered with low probability. The generation policies that we built into YARPGen were developed by hand, based on our observations about program characteristics. Automatically devising generation policies, perhaps using a machine-learning-based approach to detect interesting characteristics of real C and C++ programs, would be an interesting future project.

Undefined Behavior Avoidance

When a Java program attempts to access an array element that is out of bounds, the program error is caught and an exception is thrown. C and C++ take a different approach to such errors: they completely ignore them. If your C++ program attempts to access an array outside its bounds, your program might immediately crash, it might keep executing in a corrupted state, or it might continue to execute as if nothing bad had happened. These languages contain hundreds of “undefined behaviors,” which is just a fancy way of saying “program errors that probably cause something very bad to happen, but we’re not saying what it is — so good luck with that!” A partial list of undefined behaviors in C can be found in Annex J of the C99 specification.

For people testing compilers, the problem with undefined behaviors is that different compilers, or even different optimization levels of the same compiler, take programs that execute undefined behaviors and make them have different results. These different results are not because of compiler bugs, but rather because of program bugs! Here is a simple example where the GNU C compiler turns an undefined program (which violates a somewhat obscure rule about when it is legal to modify a variable) into an executable printing “3” and the Intel C compiler turns the same program into an executable printing “4.” Neither compiler is wrong, and if we try to convince the developers of either compiler that this discrepancy is evidence of a bug, they will get annoyed with us for wasting their time.

To avoid this problem, YARPGen creates programs without undefined behaviors. It does this by embedding an interpreter for C and C++ inside the generator, which looks for undefined behaviors in the randomly generated code. When undefined behavior is found, it rewrites the code in such a way that the problem is avoided. For example, the subexpression on the left, which is undefined because it contains a signed integer overflow (see Section 6.5, paragraph 5 of the C99 standard), can be rewritten as the subexpression on the right:

Avoiding Undefined Behavior by Eliminating Signed Overflow

This process is repeated in a bottom-up fashion on expression trees, and across multiple expressions in a program, giving the final invariant that the program is free of all undefined behaviors:

Iteratively Eliminating Undefined Behavior

These rewrite rules simplify generator structure because they are applied locally and guaranteed to succeed. YARPGen’s strategy for avoiding undefined behaviors was largely motivated by previous random program generators, such as Csmith, which inserted runtime checks into generated code to avoid undefined behavior. Our belief is that those runtime checks sometimes make it harder to detect compiler bugs, so we wanted to completely avoid using them.

What Bugs Does YARPGen Find?

Over a two year period we were able to find and report more than 220 bugs in GCC, LLVM, and the Intel® C++ Compiler. The majority of bugs that we reported have been fixed by compiler developers. Most of the bugs were in the compilers’ optimizers. Broken down by compiler, the results are: 

YARPGen Finds Bugs

About 10% of the LLVM bugs that we found were independently rediscovered and reported by application developers.

Conclusion

YARPGen is effective in finding bugs in C and C++ compilers that have been missed by other testing efforts, and it is open source software. We’re currently working on a new version of YARPGen that focuses on finding bugs in loop optimizations.

Bio: Vsevolod Livinskii is a PhD student in computer science at the University of Utah, Dmitry Babokin is a Staff Software Engineer at Intel, and John Regehr is a professor of computer science at the University of Utah.

Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.