The vast amount of code available on the web is increasing on a daily basis. Hosting sites such as GitHub contain billions of lines of source code. Community question-answering sites, such as Stack Overflow, provide millions of code snippets with corresponding text and metadata. Executable binaries provide an even greater repository of code. This is rich, often high-quality data. It can be used to support automating many of our daily programming tasks. By automating the mundane and repetitive parts of programming, developers can focus on the more creative aspects of work.
In recent years, we have seen a lot of exciting work on using machine-learning models for predicting program properties. Specifically, deep neural models have been shown to be very useful for a variety of programming-related tasks (labeling, captioning, summarization, retrieval). In this post, we informally introduce some of the challenge problems in this line of work, and discuss recent research that addresses them. It is the first in a series of posts that will discuss different aspects of deep learning over code.
What programming problems are amenable to machine learning? Several have seen recent interest.
Semantic Labeling of Code Snippets (Code -> Label)
Given a code snippet, this task attempts to predict a descriptive label. In fact, this task is more general, and can be used for prediction of any program element given a full or partial program (as done, for example here and here). For the purpose of this post, we focus on the case in which we predict a single descriptive label, similar to predicting descriptive method names or variable names (PLDI’18, ICLR’18).
Code Captioning (Code -> Text)
Given a code snippet, this task attempts to predict a natural language caption that describes the code snippet. Early solutions for this task (e.g., CodeNN, ConvAttention) were based on a textual representation of the code snippet and often relied on variants of sequence-to-sequence (seq2seq) architecture borrowed from machine translation.
Code Generation (Code -> Code)
Given a code snippet with a “hole”, this task attempts to generate a completion (that is larger and more complex than a single token). The main challenge here is that the space of possible completions is infinite; additionally, considering all elements in the pre-existing program is very difficult. Some works made simplifying assumptions and limited the prediction space to simple expressions, API-heavy completions, or allowed a budget of compiler runs. The generation task can be defined with additional context (e.g., a natural language description instead (ACL’17,ACL’18) or together with a partial program and class members, or APIs).
This is actually not a single task, but a wide variety of tasks with different characteristics. Some common bugs detected using deep learning are var misuse bugs (ICLR’18, ICLR’19), and name-based bug detectors.
Semantic Code Search (Text -> Code)
Programmers rely heavily on code search, and even experienced programmers look for code examples all the time. There has been a large body of work trying to map natural language queries to corresponding code snippets (see FSE’19 for evaluation of some approaches). Most of these approaches compute embeddings of natural language and code embeddings into the same space, and find code for a textual query by distance in this shared space.
Neural techniques for tackling these tasks are based on learning a (latent) statistical model from a large amount of code and using the model to make predictions in new programs. A major challenge in these techniques is how to represent instances of the input space to facilitate learning.
There is a wide range of different representations, some of which are outlined in the figure above. While there are representations that are based on dynamic and symbolic information, here we focus on representations that are based on static information. On the left side of the figure, a direct approach to learning is using the surface text of the program as its representation. While simple, learning directly from the token stream puts much effort on the learning model that needs to re-learn the (known) language syntax from scratch. As we go towards the right-hand side of the figure, the static analysis becomes deeper, and analysis effort increases. This allows the learning to leverage more semantic information about the programs, and therefore (often) decrease the learning effort. Of course, going too deep with the analysis might make the analysis itself too task-specific, too language-specific, or just too expensive to apply to a large corpus of code.
We conclude this post by elaborating on a few popular representations.
Recent years have seen tremendous progress on learning from textual representations due to progress in NLP. Increasingly sophisticated architectures, with overwhelming computation power (e.g., 512 TPUs) are used to tackle various language modeling tasks. These can be directly applied to PL tasks, and indeed are often used as baselines in PL papers. In this approach, the learning effort is often prohibitively expensive in terms of training time/resources, or in the amount of training data required to feed the model. Depending on the specific task, this approach may not be feasible. For example, in the semantic-labeling task, state-of-the-art text-based approaches required significantly more resources to obtain inferior results when compared to models that leveraged the program syntax. The intuitive explanation is that when learning from the surface text, the model has to have sufficient capacity to (re-)learn the programming language syntax.
Graph Based Representations
One elegant family of representations is based on graphs. These representations rely on a local static analysis to extract control-flow and data-flow information and provide them as part of the input to the neural architecture. Gated graph neural networks (GGNNs) have been used for a wide range of tasks outside PL, and have been successfully used at Microsoft Research for code completion,finding var-misuse bugs and code summarization.
Representations based on AST-Paths
The approach presented in Alon et al.’s PLDI’18 paper uses different path-based abstractions of the program’s abstract syntax tree. In analogy to learning natural language as a sequence of words and n-grams, the main idea in AST paths is to use sequences of nodes, as they exist in the AST, as the basic representation for code in a learning model. This family of path-based representations is natural, general, fully automatic, and works well across different tasks and programming languages.
The following is an illustration of the program’s abstract syntax tree (AST) and a path in the AST, connecting the two occurrences of the variable d.
The path from the first occurrence of the variable d to its second occurrence can be represented as:
SymbolRef ↑ UnaryPrefix! ↑ While ↓ If ↓ Assign= ↓ SymbolRef
This is an example of a pairwise path between leaves in the AST, but in general, the family of path-based representations contains n-wise paths, which do not necessarily span between leaves and do not necessarily contain all the nodes in between.
Using a path-based representation has several major advantages:
- Paths are generated automatically: there is no need for manual design of features aiming to capture potentially interesting relationships between program elements. This approach extracts unexpectedly useful paths, without the need for an expert to design features. The researcher is only required to choose a subset of our proposed family of path-based representations.
- This representation is useful for any programming language, without the need to identify common patterns and nuances in each language.
- The same representation is useful for a variety of prediction tasks, by using it with off-the-shelf learning algorithms or by simply replacing the representation of program elements in existing models.
- The features capture long distance syntactic relationships between program elements, and may thus provide rich context for making a prediction.
- AST paths are purely syntactic, and do not require any semantic analysis.
- The method is local, and works on a small code fragment as a single function, without requiring additional dependencies from the whole project as input.
An increasing body of work has explored machine learning to solve the above-listed programming problems using these various representations. More details are contained in the linked papers, and we will elaborate on them in the next post in this series.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.