Compiling a program is a fascinating, complex journey through which a high-level program becomes lower-level executable code. How is such a journey planned and realized? After developing the techniques, how to teach them to others?
There is lately much discussion of the pedagogy of compiler classes. In this blog post we would like to spotlight one design of compiler construction classes: the presentation order of the material. Usually, compiler classes present a compiler’s phases in the order they’re performed. We call this the “forward” direction. At Tel Aviv University, we’ve experimented with teaching compilers “backward”. This post presents this approach and describes what we see as its key benefits.
Compiler Phases, Taught in Reverse
Let’s first briefly describe, in the “forward” fashion, how a compiler operates. First, the input program’s text is converted, through lexical and syntactic analysis, to an abstract syntax tree (AST), a representation that’s easy to manipulate. Second, the compiler performs a series of semantic checks, ensuring that the program “makes sense” – variables are defined before they’re used, types add up, etc. In the process the compiler constructs a symbol table, and attributes a type to every expression, ingredients which are also used in later stages. The compiler then proceeds to the third stage of generating code in a low-level intermediate representation (IR), using the symbol table and typing information from before. Finally, the IR code is translated to the target language (say, x86 assembly).
The backward order teaches these phases in reverse: first, code generation; then semantic analysis; and then syntactic analysis. (Ours is not the first class to attempt the backward order – see the acknowledgments for an elaboration.)
Why teach the course backwards?
The first and foremost conceptual challenge in a compiler class is to understand how high-level code can be mapped to low-level code. The backward order first clarifies the compiler’s goal, the “what” of compiled programs before the “how” of the various compilation phases; and it’s useful to know where you’re going to understand how to get there.
After setting the goal, the backward order works back from this less-familiar end to the more-familiar means. This is a common problem-solving strategy. Some studies have shown that a backward heuristic is preferred by novices when they solve problems which experts solve forward (Larkin et al. 1980, Simon & Simon 1978). Crucially, we feel that studying the compiler’s stages backward better motivates each of them. (This is similar to the approach of the famous Nand to Tetris class.)
A Backward Structure of a Compiler Project
How does the backward order work? In our class, the students develop a compiler that translates from MiniJava – a subset of Java – to the LLVM low-level intermediate representation (LLVM-IR), based on materials kindly provided by Jens Palsberg, Hal Perkins, and Yannis Smaragdakis.
We structured the project as four independent exercises (see the figure):
- Variable and method renaming: The input is a program’s AST, and the task is to rename a variable or method; that is, to produce a new AST with the new name in the declaration and uses. The renaming has to take into account scoping, variable hiding, the class hierarchy, and inheritance. This exercise is not part of the compiler per se, but develops basic skills that are necessary for the next exercises (rather than bundling them with the later exercises): implementing passes over the AST using visitors, and constructing a symbol table. The analysis assumes that the input AST represents a program that is semantically valid, including typing rules and rules for valid inheritance.
- LLVM-IR code generation: The input is a program’s AST, and the task is to produce equivalent code in LLVM-IR. Students generate low-level translations of high-level constructs while employing the symbol table they implemented in the previous exercise. All this under the assumption – to be enforced in the next exercise – that the input AST represents a program that’s semantically valid, including typing rules; the translation relies on the declared static types (without additional checking) to choose storage type and low-level instructions. Thus, if the input program doesn’t conform to the semantic rules, the translation may fail to generate code, or generate incorrect code. This observation leads to the next exercise.
- Semantic analysis: The input is a program’s AST, and the task is to decide whether it conforms to a list of semantic rules. We derived the list from the assumptions we made during code generation (including those we at first forgot we’d assumed…).
- Syntactic analysis: Finally, the input is a program code in text, and the task is to generate an AST that captures it, or report an error if the program doesn’t match the language’s grammar. This accounts to both lexical analysis and parsing, using lexer and parser generators.
This order of implementation is enabled by:
- AST input: In all but the last exercise, the input is not a usual textual representation of programs, but ASTs. We used an XML encoding of a specific AST structure we chose, and supplied code to decode XML files into a Java class hierarchy which students’ code manipulates. This way we could defer parsing until the last exercise, but it meant that students had to directly encode the ASTs of their test programs. Students weren’t exactly thrilled about the prospect of writing ASTs by hand, but it mostly worked out OK.
- Simplified typing: The first two exercises were assigned before we discussed type analysis, which is part of the third exercise. But type information is essential for both these tasks (as outlined above). We found that relying on the static type definitions – of which Java has plenty – suffices, if we adopt another restriction: that method calls are disallowed on the result of a method call (x.f().g() is an error). We included this restriction in the semantic checks students had to verify. (In principle code can be automatically transformed to obey this rule via variable extraction.) Overall, in this way, types can be used before they’re validated.
Listed backward, of course.
There’s more to a program than its syntax
This is a mental barrier to overcome. Working solely with an AST representation of programs until very late helps with that.
Lowering in the spotlight
In the backward approach, the correspondence between the high-level code students write and a low-level code implementation is already the topic of the second exercise, which is the first exercise that is part of the compiler itself. We think that this is a good focus.
“Well-typed programs cannot go wrong”
The way we pitched the exercise on semantic checks earlier, the role of the semantic checks is to ensure that the input program satisfies all the assumptions employed in the code generation phase. Since students have already implemented code generation and used the assumptions, this ascribes meaning to the long and seemingly-pointless list of semantic checks. This is because we expect from the compiler to generate correct code for every program that passes the semantic checking phase – a very tangible manifestation of the type safety principle that well-typed programs cannot go wrong.
Parsing is well-motivated
Parsing constructs an AST. How ASTs facilitate the compiler’s analysis, and that they are preferable to manipulating the program text, should be well understood at this point, after implementing multiple analyses in the various exercises.
There and back again
One of our takeaways from this experience is that compilation is best understood holistically; the interfaces between compilation phases and their interdependencies are truly understood only when taken as a whole. It’s an interesting thought experiment, then, to consider these dependencies in both directions, rather than only in the usual, forward, way.
We enjoyed teaching the class backward, but also found that some aspects were harder to explain in this workflow.
Where does assembly generation fit in?
We were uncertain about when to teach the translation from IR to assembly, with the special challenge of activation records. Asking students to tackle these first, as a “pure” backward approach mandates, is to start with what may be considered less-central themes. (We taught assembly generation but didn’t ask students to implement it, due to time constraints.) An alternative is to pose assembly generation as a concluding project of yet another full compiler, this time of IR to assembly. Similar questions arise for when to teach compiler optimizations.
What are static types?
Code generation in an object-oriented language builds on the difference between static and dynamic types. We assumed familiarity with these concepts would come from a prior class on object-oriented programming, but for students who struggled with this we found the distinction hard to explain before discussing type checking.
The other role of semantic checks?
Semantic checks are not only important for the compiler, to guarantee the validity of code generation; they have another role: alerting the programmer to nonsensical code (even if the compiler can somehow “overcome” the error and generate code!). This other role may seem secondary when teaching backward.
Why runtime checks?
When implementing code generation, students must generate code to perform runtime checks, but other checks are “deferred” to the earlier stage of semantic checks. The reasoning behind the stage at which each check is performed doesn’t become clear until the discussion of semantic checks and static analysis.
Why this specific AST structure?
In our current structure, “programs” are synonymous with “ASTs” until very late, in the parsing exercise. This may give the illusion that our choice of AST is intrinsic to the language. We feel that students cannot be expected to make informed choices when designing the AST before seeing a full compiler, whichever direction used, but alluding to a fixed AST structure is sometimes too easy (we were sorely tempted when we explained the difference between semantic and syntactic errors). Also, not constantly referring to the language’s grammar made it harder for students to become accustomed to the syntactic subset of the language they were compiling.
An altogether different course structure is when the compiler is built incrementally to support more and more language features; where applicable, it solves the question of order by working on all phases together. In a more traditional structure, a “backward approach” to teaching compilers is just one possible answer to several questions, most prominent of which are how to make the compiler’s end-goal tangible, and how to motivate parsing and semantic checks as effectively as possible. The experience of implementing a code analysis outside the traditional compiler setting, in the exercise on variable renaming, is a cool bonus. The backward approach also raises questions of its own, particularly along the axis of how the compiler filters invalid programs, which may benefit from the forward, operational direction. We hope to see further ideas on achieving a synthesis of these approaches.
Bio: Yotam Feldman is a PhD student at Tel Aviv University, working on formal methods, hence talking about forward and backward (reachability) on most days. Mooly Sagiv is a professor of Computer Science at Tel Aviv University and the CEO of Certora. Mooly’s hobbies include Program Analysis, Program Verification, and Running.
Acknowledgments: Our class was inspired by Steve Zdancewic’s class at UPenn (which was inspired by Greg Morrisett’s class at Harvard and Andrew Myers’s class at Cornell), and drew on materials for compiling MiniJava graciously provided by Jens Palsberg, Hal Perkins, and Yannis Smaragdakis. A back-to-front approach to compiler classes was employed in other classes as well, and also discussed in a previous PL Perspective by Lindsey Kuper.
We thank Oren Ish-Shalom, Jens Palsberg, Hila Peleg, Hal Perkins, and Steve Zdancewic for graciously lending us their materials, Oren Ish-Shalom and Hila Peleg for consulting on project design, and Oren Ish-Shalom, Jens Palsberg, James R. Wilcox, and Steve Zdancewic for comments and suggestions on a draft of this post.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.