There’s a new ecosystem of deep-learning-driven applications, occasionally titled Software 2.0, that integrates neural networks into a variety of computational tasks. Such applications include image recognition, natural language processing, and other traditional machine learning tasks. These techniques have also grown to include other structured domains that programming languages researchers are familiar with, such as program analysis and program optimization.
With the success of these systems, a good question is then, what are the software development challenges for Software 2.0? Along with the many serious questions about interpreting, debugging, validating, and verifying these new systems is the major concern about their energy-efficiency and performance. The techniques taken up by the ML community to address this concern have a surprising similarity to those of the Approximate Computing community, a community with roots in the programming languages that also has shared aspirations for developing principled techniques for reasoning about otherwise uninterpretable computations.
Overparameterization
A key component of the success of deep learning techniques is overparameterization: this happens when a machine learning model — such as a neural network — has more parameters than data points.
Contemporary wisdom is that relatively simple optimization methods, like gradient descent, work for training neural networks on large, complicated datasets because overparameterization improves learning in both total accuracy and accuracy as a function of data points of training. For example, see work by Li and Liang, or Arora et al.
While overparameterization may be good for learning, it reduces performance, increases energy consumption, and decreases ubiquity (requiring significant resources from the user to execute the software). Therefore an ongoing systems research question is, how do we reduce overparameterization? While the machine learning community has developed a number of techniques to reduce overparameterization, many of these techniques are related to those in the Approximate Computing community.
Reducing Overparameterization
The ML community has developed a number of techniques that reduce parameter counts after training by more than 90% while still maintaining the accuracy of the neural network on the end task. The benefits are multi-fold: 1) reducing the representation size of the network reduces storage and communication costs, thus “distilling its knowledge”; 2) reducing parameters eliminates computation; and 3) the combination of effects together improves performance and energy consumption.
For example, the following code implements a neural network layer that computes a matrix-vector product of its internal weights (weights
) with its m
-dimensional input (x
). The result is an n
-dimensional vector passed through a Rectified Linear Unit activation function (max(0, output)
). The n
-dimensional result denotes the output of n
neurons.
float x[] = { ... };
float weights[][] = { ... };
float output[] = { ... };
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < m; ++j)
{
output[i] += weights[i][j] * x[j];
}
}
return max(0, output);
Quantization reduces the number of bits used to represent each weight and to compute each operation within the neural network’s computation. As an example, this program could be written to use 16-bit precision floating-point instead of 32-bit float
types.
Pruning takes a trained large model and eliminates weights by, for example, removing the weights of the smallest magnitude. For this example program, this is equivalent to eliding a subset of the loop iterations based upon a pre-determined mask. Eliding iterations of the loop over j
, elides individual weights in each of the n
neurons while eliding iterations of the loop over i
elides neurons in their entirety. Both options have been explored in the literature.
Distillation takes a large, trained model and trains a smaller network to mimic the outputs of this model. In this example, this entire layer could be substituted with an alternative implementation. For example, this computation is no more than a matrix-vector product and therefore it is possible to accelerate this computation by instead learning a fast, low-rank approximation of weights
and computing with that instead.
Approximate Computing
Although these techniques seem restricted to the soft, approximate nature of deep learning, the programming languages, computer architecture, and systems communities have for some time exploring these ideas for more traditional programs. These lines of work are collectively referred to as approximate computing. The researchers that are part of it make up a thriving community. For example, the SIGPLAN Workshop on Approximate Computing (WAX) has been running since 2014, and draws participation from SIGPLAN and SIGARCH members.
Each of the techniques mentioned above specialized to deep learning has a direct analogue to a technique in approximate computing, specifically developed by researchers in the PL/SE/Architecture communities:
Quantization is simply floating-point precision tuning, i.e., choosing less precise arithmetic as long as the desired accuracy bound permits it. Some exemplars include work by Darulova and Kuncak and Chiang et al (both appearing at POPL), Rubio-González et al (appearing at SC) and Schkufza et al (appearing at PLDI).
Pruning is simply loop perforation, which involves eliding computations entirely, e.g., by skipping loop iterations. A fair amount of research has considered the performance/accuracy tradeoff of such approaches, including work by Misailovic et al (from ICSE), and Sidiroglou et al (from FSE).
Distillation is function substitution, i.e., replacing entire sub-computations whole cloth with less expensive, approximate implementations. Works by Esmaeilzadeh et al (from MICRO) and Hoffman et al (from ASPLOS) consider this technique.
Shared Questions
A deep challenge with approximate computing in traditional programs is that the techniques change the semantics of programs. Changing the semantics of programs raises a number of new and fundamental questions. For example, what is the probability that the resulting program will produce the same result as the original program? How much do the results differ from those produced by the original program? And is the resulting program safe and secure?
These questions certainly must be considered for quantization, pruning, and distillation. Though, what is interesting is that many of these questions also must be asked of neural networks to begin with, even in their overparameterized form. Once again, the Approximate Computing community’s techniques for reasoning about approximate programs may have analogues in ML community’s techniques for reasoning about neural networks. Of note are the Approximate Computing community’s efforts on reasoning about Reliability, Robustness, Accuracy and Safety, and Generalization.
This post is excerpted and edited from the author’s SNAPL 2019 paper. It contains more discussion, examples, and references. Readers can also find a comprehensive survey of Approximate Computing techniques, challenges, and future directions here.
Bio: Michael Carbin is an Assistant Professor of Electrical Engineering and Computer Science at MIT. He is interested in the semantics, design, and implementation of programming systems that operate in the presence of uncertainty: unreliable hardware, approximating compilers and runtimes, and nondeterministic/probabilistic environments.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.