PL Perspectives

Perspectives on computing and technology from and for those with an interest in programming languages.

Probabilistic modeling is a cornerstone of modern machine learning and artificial intelligence. Currently probabilistic models are typically maintained by a small number of domain experts with deep knowledge in how the model works. As probabilistic modeling proliferates in applications, a natural question arises: how can we better integrate probabilistic models into the modern software-engineering stack

Probabilistic programming languages (PPLs) give an answer to this question: they turn a programming language into a probabilistic modeling language. This lets programmers use their well-honed programming skills and intuitions to develop and maintain probabilistic models, expanding the domain of model builders and maintainers.

The central challenge for any probabilistic programming system is scaling up probabilistic inference: computing the probability distribution of the value returned by the program. Probabilistic inference is a fantastically powerful general-purpose reasoning tool with countless applications in machine learning and probabilistic verification, including verification of probabilistic networks, natural language processing, and probabilistic graphical models. Dice is our approach to scaling (exact) probabilistic inference. In this blog post, we describe how it works; it is a summary of our SPLASH/OOPSLA 2020 Distinguished Paper, Scaling Exact Inference for Discrete Probabilistic Programs.

Dice

We approach the problem of scaling probabilistic inference in part as one of language design: dice focuses on a particular class of probabilistic programs, programs that are discrete and finite. Why this focus? There are two reasons. 

First, in practice, there are many interesting finite and discrete probabilistic programs: examples that we discuss in the main paper include text analysis such as automatic frequency analysis of simple ciphers, verification of probabilistic networks, and examples from discrete graphical models. The second reason is that many existing and widely used PPLs struggle with discreteness: discreteness breaks many of the assumptions that are required by powerful approximate inference algorithms like Hamiltonian Monte-Carlo and variational inference.

Example

What does a dice program look like? Let’s work through a simple example. The dice language has a few different sources of probabilistic uncertainty: for instance, the syntax “flip 0.4” represents a coin flip that is true with probability 0.4 and false with probability 0.6. We can use this “flip” syntax naturally in the following dice program:

let x = flip 0.1 in
let y = if x then flip 0.2 else flip 0.3 in
let z = if y then flip 0.4 else flip 0.5 in
z

For the above program, we can compute the probability that the value “true” is returned as the following sum of products:

\displaystyle 0.1 \times 0.2 \times 0.4 + 0.1 \times 0.8 \times 0.5 + 0.9 \times 0.3 \times 0.4 + 0.9 \times 0.7 \times 0.5 = 0.471

The above sum simply enumerates all possible assignments to the random variables in the program: for instance, \displaystyle 0.1 \times 0.2 \times 0.4 represents the probability that x is true, y is true, and z is true. This exhaustive enumeration approach to inference is simple, but its fatal flaw is that it cannot scale to large programs: the number of terms in the sum is exponential in the number of flips. In general, scaling inference is the gatekeeper for the broad application of PPLs: finding new and faster approaches to probabilistic inference is a grand challenge for the field.

Factorized Inference

In order to scale inference to large and interesting programs, we will need to do better than exhaustive enumeration. Dice takes a new approach to exact inference in discrete probabilistic programs based on factorization

Avoiding Redundant Computation

The key idea behind dice inference begins with an observation about the sum-of-all-paths expression in the first section: it has lots of redundant computation. Concretely, we can refactorize this sum into a different equivalent form:

\displaystyle 0.1 \times (0.2 \times 0.4 + 0.8 \times 0.5) + 0.9 \times (0.3 \times 0.4 + 0.7 \times 0.5) = 0.471

For this small example, the savings may seem modest. But as the program grows in size, these savings add up to orders-of-magnitude improvements

In the main paper we show that exploiting factorization gives asymptotic improvements on examples from text analysis and network verification, from exponential to linear time. This factorization occurs because the program has a particular data-flow property: the variable z does not depend on the variable x if one knows y’s value. This is known as conditional independence and is one form of structure that dice will exploit in order to factorize the inference computation. The question here is: how does dice find and exploit such factorizations?

Binary Decision Diagrams (BDDs)

At a high level, dice compiles probabilistic programs to binary decision diagram (BDD) in order to perform inference by weighted model counting. This compilation is performed modularly — each function is compiled to a BDD separately, and subsequently reused at call-site — and exploits the compositional construction of BDDs to remain compact and efficiently manipulable. Once this compilation is complete, inference is linear time in the size of the BDD. Ultimately, dice strikes a unique balance between a rich variety of program features — including functions, integer operations, control-flow, and bounded iteration, and Bayesian conditioning — and scalable inference. We show that dice can scale to extremely large probabilistic programs with hundreds of thousands of random variables, and can even compete with specialized Bayesian network inference algorithms on their own benchmarks. In addition, we prove this inference algorithm correct with respect to a denotational semantics.

A binary decision diagram (BDD) and associated weighted model counts for performing inference on the example program.

Using BDDs for Inference

Now we will expand on these high-level ideas. The main idea is to represent the state space of the program as a BDD, which makes subsequent inference computations fast. The figure shows the compiled BDD for the running example program. BDDs are read top-down. Each logical variable in the BDD corresponds with an assignment to a flip in the original program: f_1 is true if and only if “flip 0.1” is true, and so on.  A solid edge denotes the case where the parent variable is true and a dotted edge denotes the case where the parent variable is false. Then, a model of this BDD — a set of assignments that cause it to evaluate to true — directly corresponds with a set of assignments to flips that cause the program to return true.

This BDD exploits the program’s conditional independence to efficiently produce a compact canonical representation of the state space. Specifically, there is a single subtree for f_4, which is shared by both the path coming from f_2 and the path coming from f_3, and similarly for f_5. These shared sub-trees are induced by conditional independence: fixing y to the value true — and hence guaranteeing that a path to f_4 is taken in the BDD — screens off the effect of x on z, and hence reduces both the size of the final BDD and the cost of constructing it. The BDD construction process automatically finds and exploits such factorization opportunities by caching and reusing repetitious logical sub-functions.

Once the BDD is constructed we can perform inference in linear time in its size. First, we associate a weight to each assignment to each logical variable in the BDD: for instance, w(f_1) = 0.1, and w(\bar{f}_1) = 0.9. The weight of a model is the product of the weights of the variables that constitute it: for instance, w(f_1f_2f_4) = 0.1 \times 0.2 \times 0.4, which is the first term in the unfactorized sum above. Then, the probability that the program returns true is equal to the weighted model count (WMC): the sum total of the weights of all models. 

Now, inference has been reduced to WMC. By virtue of the shared sub-functions, the BDD in Figure 1 directly describes how to compute the WMC in the factorized manner. Observe that each node is annotated with the weighted model count, which is computed in linear time in a single bottom-up pass of the BDD. For instance, the WMC at node f_2 is given by taking the weighted sum of the WMC of its children, $0.2 \times 0.4 + 0.8 \times 0.5$. Finally, the sum taken at the root of the BDD (the node f_1) is exactly the factorized sum.

Perspectives & Challenges

Probabilistic inference, at its core, is a challenging and increasingly important program analysis task. We want PPLs to be expressive, but we also want inference to be fast.

Dice’s design struck a specific tradeoff, using finite discreteness to enable a fast inference algorithm despite it limiting expressiveness. Long term, making progress on automated probabilistic program inference will involve a delicate interplay between the modeling language, the inference algorithm, and the programmer. Our job as probabilistic programming language designers is to make this interplay possible. Doing so involves answering some big questions:

  • How can we design languages that adequately trade off expressiveness and tractability for inference?
  • What are new inference algorithms that broaden the landscape of languages that afford tractable inference?
  • What are new and interesting queries that we can ask of probabilistic programs, beyond probabilistic inference?

Ultimately answers to these questions will push PPLs towards a future where they are more automated and more user-friendly, enhancing the ability of society at large to make rational scientific decisions.

Bio: Steven Holtzen is a PhD. candidate at the University of California, Los Angeles (UCLA). Guy Van den Broeck and Todd Millstein are professors at UCLA. 

Acknowledgments: The authors would like to thank Mike Hicks for helpful comments on this post. 

Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.