Artificial intelligence and machine learning technologies are being used to automate many difficult tasks, such as image recognition, and are producing remarkable results. In the last few years, researchers and entrepreneurs have turned their attention to the task of programming. The hope is that by automating tedious and repetitive programming tasks, developers can focus on the more creative aspects of writing software.
In a previous post, we discussed a number of challenging tasks that can be automated by using machine-learning models to predict properties of programs. These tasks include semantic labeling of code snippets, captioning a block of code, generating code to complete a missing piece of a larger program, detecting mistakes or vulnerabilities in code, and searching for code based on a textual description. In that post we also considered the key decision of how to represent instances of the input space (i.e., code) to facilitate learning. In this post, we take the next step to explore how these representations in support of two of the above-listed tasks: semantic labeling of code and code captioning. The final post in the series will consider the remaining tasks.
Semantic Labeling of Code Snippets
Consider the following code snippet. This snippet only contains low-level assignments to arrays, but a human reading the code may (correctly) label it as performing the reverse operation. Our goal is to be able to predict such labels automatically.
The right hand side of the figure shows the labels predicted automatically using code2vec. The most likely prediction (77.34%) is reverseArray. You can play with additional examples on the code2vec website and find more details about the approach in the POPL’19 paper. While the approach is general, as we explain later, here we demonstrate it on the problem of inferring meaningful method names, originally defined by Allamanis et al. back in 2015 and extended by Allamanis et al. in 2016.
Intuitively, this problem is hard because it requires learning a correspondence between the entire content of a code snippet and a semantic label. That is, it requires aggregating possibly hundreds of expressions and statements from the snippet into a single, descriptive label.
Captioning code snippets
Consider the short code snippet below. The goal of code captioning is to assign a natural language caption that captures the task performed by the snippet.
For this example, the code2seq approach automatically predicts the caption “save bitmap to file”. You can play with additional examples on the code2seq website and find more details about the approach in the paper.
Intuitively, this task is harder than semantic labeling, as it requires the generation of a natural language sentence in addition to capturing (something about) the meaning of the code snippet.
Semantic Labeling of Code Snippets – Code2Vec: Learning Distributed Representations of Code
The POPL’19 paper by Alon et al. presented a framework for predicting program properties using neural networks. The main idea is a neural network that learns code embeddings – continuous distributed vector representations for code. The code embeddings allow us to model correspondence between code snippet and labels in a natural and effective manner. By learning code embeddings, our long term goal is to enable the application of neural techniques to a wide-range of programming-languages tasks. A live demo of the framework is available at https://code2vec.org.
The neural network architecture uses a representation of code snippets that leverages the structured nature of source code, and learns to aggregate multiple syntactic paths into a single vector. This ability is fundamental for the application of deep learning in programming languages. By analogy, word embeddings in natural language processing (NLP) started a revolution of application of deep learning for NLP tasks.
The input to the model is a code snippet and a corresponding tag, label, caption, or name. This tag expresses the semantic property that we wish the network to model, for example: a tag, name that should be assigned to the snippet, or the name of the method, class, or project that the snippet was taken from.
Let C be the code snippet and L be the corresponding label or tag. The underlying hypothesis is that the distribution of labels can be inferred from syntactic paths in C. The model therefore attempts to learn the tag distribution, conditioned on the code: P(L|C).
The problem is as follows: given an arbitrarily large number of context vectors, we need to aggregate them into a single vector. Two trivial approaches would be to learn the most important one of them, or to use them all by vector-averaging them. These alternatives are shown to yield poor results.
For the full details of the model, see the POPL’19 paper (or better yet, the open-source implementation). At a high-level, the key point is that a code snippet is composed of a bag of contexts, and each context is represented by a vector that its values are learned. The values of this vector capture two distinct goals:
- the semantic meaning of this context, and
- the amount of attention this context should get.
Our main observation is that all context vectors need to be used, but the model should learn how much focus to give each vector. This is done by learning how to average context vectors in a weighted manner. The weighted average is obtained by weighting each vector by its dot product with another global attention vector. The vector of each context and the attention vector are trained and learned simultaneously, using the standard neural approach of backpropagation.
Despite the “black-box” reputation of neural networks, our model is partially interpretable thanks to the attention mechanism, which allows us to visualize the distribution of weights over the bag of path-contexts. The following figure illustrates a few predictions, along with the paths that were given the most attention in each method. The width of each of the visualized paths is proportional to the attention weight that it was allocated. We note that in these figures, the path is represented only as a connecting line between tokens, while in fact it contains rich syntactic information which is not expressed properly in the figures.
Captioning code snippets: code2seq: Generating Sequences from Structured Representations of Code
The ICLR’19 paper by Alon et al. presented a framework for generating sequences such as natural language descriptions, from code. The main idea is to break each representation of code2vec into more fine-grained building blocks, and this way allow the model to generalize much better:
- AST Paths, that were represented as monolithic symbols in code2vec, are broken into a sequence of nodes. The full path representation is obtained by reading these nodes one-by-one, using LSTMs.
- The target sequences are broken into words, rather than monolithic labels as in code2vec, and these words are predicted one-by-one using another LSTM. At each such decoding step, the network computes a different weighted average (attention) of the input paths, allowing it to “focus” on different aspects of the code while predicting every output word.
The code2seq architecture is based on the standard seq2seq architecture with attention, with the main difference that instead of attending to the source words, code2seq attends to the source paths. In the following example, the mostly-attended path in each prediction step is marked in the same color and number as the corresponding predicted word in the output sentence.
The code2seq model was demonstrated on the task of method name prediction in Java (in which it performed significantly better than code2vec); on the task of predicting StackOverflow natural language questions given their source code answers (which was first presented by Iyer et al. 2016); and on the task of predicting documentation sentences (JavaDocs) given their Java methods. In all tasks, code2seq was shown to perform much better than strong Neural Machine Translation (NMT) models that address these tasks as “translating” code as a text to natural language.
For additional details, see the online demo and the open-source repository.
In the final post in the series, we’ll consider the remaining tasks we introduced in the first post: generating code to complete a missing piece of a larger program, detecting mistakes or vulnerabilities in code, and searching for code based on a textual description.
Bio: Eran Yahav is an associate professor of Computer Science at the Technion, Israel, and the CTO of Codota. Eran loves program synthesis, machine learning on code, and static analysis.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.