This is the third part of our series on using deep learning for programming-related tasks.
In our first post, we discussed a number of challenging tasks that can be automated by using machine-learning models to predict properties of programs. In the second post, we elaborated on some of these tasks.
In this post, we will discuss the task of automatic code completion. The goal of this task is to automatically complete partial code by predicting code based on context. This form of code completion may seem unorthodox to a PL synthesis audience, as it does not take into account any form of explicit specification. Indeed, the only form of conditioning for this task is on the context in which completion is performed. Conditioning on context only, without additional specification, represents the common use case of writing code in an editor.
Consider the two code snippets below. At the top, a user has written a Java method and has left some “holes” that she wasn’t sure about. At the bottom, our model has made complex predictions (full expressions, i.e., going well beyond a single token) to fill in the missing holes. In contrast to approaches like Sketching, which try to find an assignment for holes that makes the program satisfy a given specification (often in the form of a simpler / higher-level program), our completion task does not take any additional specification. It also provides multiple assignments that correspond to the most likely completions as learned by our neural model.
Our neural model does not require a partial program with holes, specifically; it can generate code completions based on any partial context. Furthermore, the model can be used for likelihood-based validation. To perform such validation, the trained model hides some of the code, compares the hidden code with its own predictions, and highlights parts of the code that do not match the predictions. These mismatches between the actual code to the likely predictions signal that code might be unlikely or unconventional (which could, and often does, correspond to buggy code).
Trivially, this problem can be modeled as predicting the next token conditioned on a sequence of existing tokens, or conditioned on a sequence of existing API calls (see early non-neural work such as Hindle et al., 2012 and Raychev et al., 2014). Alternatively, we can let a learning model leverage the (known) syntax of the language, by phrasing this problem as predicting the next node in a partial abstract syntax tree (Bielik et al., 2016; Brockschmidt et al., 2019).
Code Completion – Structural Language Models of Code
Standard language models define the probability of a natural language sequence by decomposing the probability of a sentence into a product of conditional word probabilities:
That is, the probability of the entire sentence is decomposed to the product of the probabilities of every word given the previous words.
But how can we decompose the probability of a code snippet, ?
The paper by Alon et al. (demo: http://www.AnyCodeGen.org) addressed this question using a Structural Language Model (SLM). Given a program and its AST , the main idea is to induce an ordering over the nodes in the AST: , and decompose the probability of the tree into a product of conditional node probabilities:
The paper proposes to compute each conditional node probability using the set of partial paths in the AST, originating from each existing tree leaf into the predicted node (partial paths were explained in part 1 of this series).
By conditioning on part of the program and predicting a missing subtree, the model was demonstrated on the task of generating missing subtrees in existing programs. For example, in the following example, the expression x > 1 is generated in depth-first order, conditioned on the rest of the code; at each step, the model generates the next node given all “incoming” AST paths:
In the figure, dashed lines denote the AST structure, and solid lines denote the AST paths. At each step, the probability of the next node is computed given the path from the root (shown in red) and the path from every given leaf up to the current node to expand (partially shown as blue paths in the figure). After the terminal node with the value x is predicted ((d)), originating from this leaf is also used to compute the probability of the next nodes.
Since there are multiple incoming paths and only a single predicted node at each step, all paths go through multiple layers of self-attention. This allows all paths to contextualize and interact with each other, before jointly predicting the next node that is mutual to all paths.
Applied to a dataset of real open-source code, the model was shown to predict code completions better than both structural models and state-of-the-art neural machine translation (NMT) models such as Transformers and LSTMs. The following example is a method from the hadoop-common repository on GitHub. The highlighted expression was correctly generated by the model given the surrounding code.
In contrast to code2vec and code2seq that use AST paths to read programs, SLMs use AST paths to generate programs, by predicting the next node in all paths and generating the target AST node-by-node. For additional examples, see the online demo: http://www.AnyCodeGen.org
The neural model presented here predicts code completions based on a (structured) context. The problem of combining such models with symbolic specifications remains an interesting open challenge.
Bio: Eran Yahav is an associate professor of Computer Science at the Technion, Israel, and the CTO of Codota. Eran loves program synthesis, machine learning on code, and static analysis.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.