Select Page

# PL Perspectives

Perspectives on computing and technology from and for those with an interest in programming languages.

Probabilistic programs are ubiquitous in modern computing, with applications such as randomized algorithms, machine learning and statistical modelling. Techniques and principles to formally verify classical (deterministic) programs often fail in counterintuitive ways when applied directly to probabilistic programs. This motivates the need for specialized techniques to reason about these programs.

Relational program properties, as opposed to classical properties, refer to pairs of executions of a program. Sensitivity properties are a particular class of relational properties. Sensitivity answers a very natural question: if I make a small change to the input of my program, how much can the output change? Sensitivity properties can be defined by choosing a distance $d_I$ over the input space and a distance $d_O$ over the output space such that the program, such that for any pair of inputs $i_1, i_2$ and their respective outputs $o_1, o_2$ we have $d_O(o_1,o_2) \leq d_I(i_1,i_2)$.

Our distinguished POPL 2021 paper, A Pre-expectation Calculus for Probabilistic Sensitivity explores the problem of reasoning about the sensitivity of probabilistic programs, whose behavior may depend on randomly sampled variables, and therefore their observable output defines a probability distribution over final states. In this post we will first begin by justifying the notion of sensitivity we use, then discuss how we reason about it and finally conclude by presenting some applications.

# Notions of probabilistic sensitivity

While it is clear how to define sensitivity for deterministic programs, defining it for probabilistic programs takes more care. As is traditional in probabilistic program semantics, we can model the output of a probabilistic program as a probability distribution. Then, sensitivity properties can be expressed with respect to a distance between probability distributions. However, in practice there are several probabilistic distances that are used.

Consider for instance the following program:

int foo(p):
x = 0;
t = 0;
while (t < T){
b = flip(p);
x = x + b;
t = t + 1;
}
return x;


This program samples a random bit T times and counts the total number of 1s. The probabilistic behavior is induced by the command flip(p), which returns 1 with probability p and 0 with probability 1-p. We are interested in studying the sensitivity of the output x with respect to the parameter p. For simplicity, we assume that the number of iterations T is a global constant.

Recall that a discrete probability distribution over a countable set $X$ is a function $f \colon X \to [0,1]$ such that $\sum_{x\in X} f(x) = 1$. For our program we could define sensitivity using for instance the following distances:

• Total Variation distance, or TV distance. Given two distributions, the TV distance measures how much the probability of any event can differ between them. It is defined by the sum below: $TV(f_1,f_2) = \frac{1}{2} \sum_{ x \in X} | f_1(x) - f_2(x) |$

The TV distance can be defined on every output space $X$, since the structure of $X$ is not used in the definition. As a downside, it is often complicated to compute the exact TV distance, since one needs to evaluate the sum above.

• Absolute expected difference. For programs with numeric output, as our example, sensitivity can also be defined from the absolute difference between the expected outputs: $AED(f_1,f_2) = |\sum_{ x \in X} x \cdot (f_1(x)-f_2(x)) |$

(Note: This is actually not a distance, it is a pseudodistance, but it also can be used to define sensitivity). For instance, in the example above, the absolute expected difference between the outputs of foo(p1) and foo(p2) is T * |p1 - p2|, since the loop runs T times, and the probability of increasing the counter is p1 on the first run and p2 on the second. For a more general output space $X$, we may want to reason about the absolute difference between the expected values of a function $g: X \to [0,\infty]$ on the two output distributions of the programs. These families of distances are used for instance to define stability of machine learning algorithms (see below).

While sensitivity properties for probabilistic programs are defined in various manners, it would not be convenient to have different methods to verify different notions of sensitivity. Instead, we would like to have a more general notion of distance that we can reason about, and that can then be instantiated to particular distances between probability distributions.

# The Kantorovich distance

With the goal of generality in mind, we develop a framework to reason about a distance known as the Kantorovich distance (sometimes also referred to as the Wasserstein distance). This distance constitutes a canonical lifting from a base distance $d \colon X \times X \to [0,\infty]$ to a distance $K(d): {\rm Distr(X)}\times {\rm Distr}(X) \to [0,\infty]$, where ${\rm Distr}(X)$ denotes the space of discrete probability distributions over $X$.

The definition of the Kantorovich distance is somewhat technical, so instead we give an intuition and explain how it can be used. The Kantorovich distance is sometimes known as the earth-mover distance: if we visualize a probability distribution $f$ by placing a pile of sand of size $f(x_1)$ on top of each point $x_1$, then $K(d)(f_1,f_2)$ corresponds to the cost of the optimal way of transforming $f_1$ into $f_2$ by moving sand, where the cost of moving $a$ units of sand from $x_1$ to $x_2$ is $a \cdot d(x_1,x_2)$.

The Kantorovich distance is fairly versatile. For example, the TV distance can be reconstructed as a particular case of the Kantorovich distance built from the discrete distance, which assigns a distance of 1 to all pairs of distinct points. And while in general it is not possible to define a distance on the base space that lifts to the absolute expected difference, the Kantorovich distance built from $d(x_1,x_2) = |g(x_1) - g(x_2)|$ provides an upper bound on the absolute expected difference defined from $g$. Moreover, in the degenerate case of distributions that assign all their mass to a single point, the Kantorovich distance $K(d)$ coincides with $d$. This allows us to cast sensitivity properties of deterministic programs as particular cases of sensitivity of probabilistic programs.

# A relational pre-expectation transformer

Sensitivity is not only a relational property, but it is also a quantitative property. There is already significant previous work in verifying quantitative properties of probabilistic programs. The most related to our framework is the work on weakest pre-expectations by Morgan and McIver, which uses a predicate transformer to compute the expected value of some real-valued function $f$ (called an expectation) over the output distribution of an imperative probabilistic program $C$. This transformer can be seen as a quantitative generalization of Dijkstra’s weakest precondition transformer.

The initial observation behind our work is that predicate transformers can also be generalized to reason about sensitivity of probabilistic programs using the Kantorovich distance. More precisely, we can construct a transformer that receives as an input a distance $d$ and an imperative probabilistic program $C$ and two initial memories and it provides an upper bound of the Kantorovich distance $K(d)$ between the two output distributions of $C$.

Let us be more formal. A relational expectation is a map $E \colon M \times M \to [0,\infty]$ from pairs of memories to the positive reals (note that distances are particular cases of relational expectations). In our paper we introduce a novel relational pre-expectation transformer, ${\rm rpe}$. It receives a program $C$, and a relational expectation $d$, and it returns a relational expectation ${\rm rpe}(C,d)$ such that, for any pair of input memories $m_1,m_2$, it satisfies the inequality $K(d)(C(m_1),C(m_2)) \leq {\rm rpe}(C,d)(m_1,m_2)$, where $C$ on the left side is interpreted as a map $C \colon M \to {\rm Distr}(M)$.

This transformer is a direct analogue of other predicate transformers. It is defined by induction on the structure of the program, i.e., we define ${\rm rpe}(C,d)$ for every atomic command $C$, and for composition we have ${\rm rpe}(C_1; C_2, e) = {\rm rpe}(C_1,{\rm rpe}(C_2,e))$ . The ${\rm rpe}$ for a while loop is defined by a fixed point equation and can be approximated by using a generalized notion of invariants.

As with other transformers, such as the weakest pre-expectation or the expected runtime transformer, most of the explicit reasoning about probabilities is abstracted away, and is only needed for sampling commands. By choosing an appropriate distance $d$ between memories, our transformer can be used to reason about several notions of probabilistic sensitivity.

Finally, readers may have noticed that we have not claimed anything about the precision of the sensitivity bounds our transformer computes. Indeed, ${\rm rpe}(C,f)$ does not always provide the tightest bound possible. This is due to some technicalities about the way the Kantorovich distance composes. Nonetheless, we have not observed significant gaps in practice between our analysis and other analyses of probabilistic sensitivity.

# Applications: Stability of stochastic gradient descent and convergence of Markov chains

One of the core algorithms in machine learning is the gradient descent algorithm. This algorithm uses a data set of labelled examples to train a classifier to label new, unseen examples. The classical deterministic algorithm iterates over all examples in the training set and adjusts the parameters of the classifier so that they minimize a loss function, which measures how good the classifier is at labeling examples. These datasets, however, are often massive, and training a classifier can be computationally expensive.

The idea behind the stochastic gradient descent algorithm (SGD) is that we can instead choose a few examples at random, train the classifier only with these examples, and still obtain a classifier that performs similarly as another trained with the full data set.

The SGD algorithm satisfies a property called stability: when running SGD on two adjacent training sets (meaning that they differ in a single data point), the resulting classifiers have a similar expected loss in new examples. This means that it generalizes: it should perform well on unseen examples and it limits how much it can overfit the training set.

This is a sensitivity property: it can be specified by two distances. On the input space, we can use an adjacency distance measuring the number of differing examples between two data sets. Between the outputs, we define a distance measuring the absolute difference between expected losses. This property can then be verified by our expectation transformer, as we show in our paper.

Another interesting application of our technique is verifying the rate of convergence of Markov chains. This is also a case of sensitivity, since it depends on the distance to a stationary distribution. This has several interesting applications, such as determining the efficiency of algorithms based on Markov Chain Montecarlo

If you are interested in seeing more details and applications, please read our distinguished POPL 2021 paper.

Bio: Alejandro Aguirre is a postdoc at Aarhus University in the Logic and Semantics group. He works on developing logics to reason about probabilistic programs, with an emphasis on relational properties.

Acknowledgements: The author would like to thank Gilles Barthe, Justin Hsu and Todd Millstein for the helpful feedback on preliminary versions of this post. This post is based on joint work with Gilles Barthe, Justin Hsu, Benjamin Kaminski, Joost-Pieter Katoen, and Christoph Matheja.

Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.