Hardware is more parallel than ever. For example, processors today have as many as 64 cores. To write parallel programs for such hardware, researchers have converged on using implicit parallelism, where the programmer expresses all opportunities for parallelism, and the compiler and run-time system then work together to manage parallel execution, relieving the programmer from an otherwise tedious and difficult task. Using functional languages also helps the programmer with an important safety concern—data races—by allowing greater control over effects, and even hiding effects behind pure interfaces that are safe for parallelism.
Implicitly-parallel functional programming can therefore be a game changer for parallelism, but there is the elephant in the room: poor performance. The primary reason for poor performance is memory. In particular, functional programs allocate memory at a high rate, and this rate is further increased by parallelism, where multiple cores can allocate memory at the same time. This high allocation rate is beyond what can be handled efficiently by traditional memory managers.
We believe that it is possible to solve this problem. We want to make the elephant vanish, but with no tricks like Houdini: our goal is to make the elephant vanish for real.
We have been working on this problem by utilizing a memory property of parallel functional programs called disentanglement. (It appears that we are the first to discover this property.) Using disentanglement, we partition memory into heaps and distribute heaps among threads with a heap scheduler. Each thread manages its assigned heap independently, in parallel, and without communicating with other threads. In our POPL 2020 paper, we formulated disentanglement and proposed a memory partitioning strategy. In our POPL 2021 paper (which won a Distinguished Paper award), we proposed a specific heap scheduler and showed that it yields provable work and space bounds. We implemented these techniques in the MPL compiler for parallel ML and were able to obtain impressive speedups: up to 50x on 70 cores with respect to MLton (our sequential baseline), as shown in the figure on the right. MPL also appears to compete well with procedural languages such as Java, Go, and even C/C++, in terms of both time and space. In comparison to other functional languages, a recent paper from ICFP 2021 measures that MPL’s parallel performance is on average 2x faster than both multicore OCaml and parallel Haskell (GHC).
The Vanishing Elephant
What is disentanglement?
Disentanglement is a memory property that enforces a disciplined use of effects. It holds for an execution if concurrently executing threads are (roughly speaking) “oblivious” to each other’s allocations. More specifically, for any two concurrent threads A and B, if thread A acquires a reference to data allocated by thread B, then we say the two threads are entangled. This motivates the name: disentanglement ensures freedom from entanglement.
Many parallel programs are naturally disentangled, or can be made so with only small changes to allocation patterns, partly because entanglement is often indicative of a data race. Note however that disentanglement does not outlaw data races—it only restricts a thread to access objects allocated before (as defined by sequential dependencies) the thread started, in addition to those objects allocated by the thread itself. For example, two concurrently executing threads can simultaneously inspect and update objects allocated by the parent thread that spawned them. Disentanglement differs from separation logic in this respect: it does not insist on partitioning objects into disjoint heaps, and allows access to any previously allocated object (as determined by sequential and parallel dependencies). We view disentanglement as generalizing the notion of “time” as it relates to memory for parallel programs: in both sequential and disentangled parallel programs, an instruction can only access objects that are allocated before it.
Disentanglement opens a new perspective on parallel memory management: rather than viewing memory as a global shared resource, we can instead organize memory as a distributed structure that mirrors (i.e., corresponds with) the structure of parallelism in the program. In a fork-join program, this structure is a dynamic tree or “hierarchy”: tasks may spawn two or more children to execute in parallel, and must wait for all children to complete before continuing.
As we discuss next in more detail, disentangling memory enables naturally parallel memory management. For example, the leaves of the tree are independent of each other, allowing for allocations and garbage collections to proceed entirely in parallel.
Hierarchical memory management
Hierarchical memory management takes advantage of disentanglement by organizing memory into a tree. The idea is to create dynamic tree of heaps called the heap hierarchy, where we give each task its own local heap for allocations. As the computation progresses, the heap hierarchy folds and unfolds, mirroring the tree of tasks: when a task is forked, it is given a fresh, empty heap; when a task joins its parent, its heap is also merged into its parent.
In the heap hierarchy, disentanglement guarantees a crucial property: absence of cross-pointers. That is, every object in memory might have pointers to objects above or below in the tree, but never across.
The absence of cross-pointers unlocks parallel garbage collection: leaf heaps can be collected completely independently in parallel, without any synchronization, using either moving or non-moving techniques. Internal heaps can be collected independently, too, although with a bit of bookkeeping. The idea is to “take a snapshot” at the moment a heap becomes internal (which occurs when its corresponding task spawns new subtasks) and use this snapshot as a conservative root-set for garbage collection.
With disentanglement and hierarchical memory management, the run-time system can reclaim objects from any heap independently with little-to-no synchronization. But, now that the memory is organized into separate heaps, reclaiming all garbage precisely and efficiently becomes more difficult. Intuitively, there are two reasons why. First, the reclamation algorithm might work with outdated information (i.e., root set), which leads to uncollected garbage. Second, because the reclamation algorithm operates on a small region of memory (a heap), it might waste work by triggering too frequently.
To solve these problems we developed a heap scheduler which clusters heaps so that larger amounts of memory can be collected precisely and efficiently. The idea is to assign a cluster of heaps to each processor dynamically based on task scheduling decisions, and allow each processor to garbage-collect its cluster. When clustering, size matters. On the one hand, we could have lots of small clusters, which would increase parallelism, but lead to wasted work and space. On the other hand, we could use a few large clusters, which would improve work and space but lead to less parallelism. We aim for the sweet spot: less total work and space, and lots of parallelism.
Showing some spine. Our heap scheduling algorithm clusters the “spines” of the hierarchy and assigns each spine to a processor: in each spine, one child of every parent is always on the same processor, so the pointers within spines are precise. Because each processor owns its own spine, it can operate on the spine independently. The heap scheduler creates spines by “listening” to the task scheduler, e.g., a work-stealing scheduler. When a processor (worker) steals a child, a new heap is created and assigned to the thief. When tasks have completed and are ready to join, the heap scheduler joins the corresponding heaps and assigns the resulting heap to whichever processor executes the continuation. This way, tasks and their corresponding heaps are always scheduled in tandem.
We analyze work and space of parallel runs by relating them to sequential executions. The key insight is that each spine corresponds exactly to a segment of a sequential execution. We can therefore bound space usage by allowing each processor to collect its own spine using a standard amortization-based policy (e.g., after performing GC on a spine, a processor waits until its spine is at least twice as large before triggering another GC). On P processors, this leads to a worst-case bound of approximately P times the sequential space overall.
How well does MPL work outside the research lab? At Carnegie Mellon University, the introductory data structures and algorithms course (with over 500 students per year, mostly sophomore undergraduates) covers parallel and sequential algorithms together. In this course, we use Parallel ML with our MPL compiler as the implementation infrastructure. We provide students with their own dedicated 64-core machines for experimentation (via a cloud platform) and many students observe significant speedups in practice. The programming assignments include parallel algorithms with divide-and-conquer, dynamic programming, and graphs. Because the students write functional programs, they usually do not have to worry about race-conditions at all.
Using functional programming with our MPL compiler, undergraduate students learn parallel algorithms and implement them in practice without experiencing additional difficulties due to parallelism. This contrasts with our prior experience using procedural languages. In particular, when we used C and C++ for some assignments, we found that a majority of our students spent many hours debugging race conditions, which does not make meaningful contributions to the learning goals of an algorithms class.
Researchers have for decades extolled the benefits of functional programming for parallelism, which has become an important and relevant problem. Functional parallel programs, however, have traditionally suffered from poor performance in practice, primarily because automatic memory management is difficult to scale. Our research has made significant progress on this challenge by identifying and exploiting a memory property of parallel programs, called disentanglement, to support parallel memory management by partitioning the memory into smaller heaps that operate independently. By using an insight on clustering heaps, we are able to establish work and space bounds. Our implementation experience with the MPL compiler shows that these techniques are also highly practical, yielding good performance and scalability. Research thus far shows that disentanglement is a promising approach to managing memory but we seem to have barely scratched the surface. Future research directions include generalizing the technique beyond fork-join program, allowing and managing entanglement without harming the benefits of disentanglement, and improving resource bounds.
Bios: Umut Acar is an associate professor at Carnegie Mellon University; Jatin Arora and Sam Westrick are his PhD students there. Their research focuses on the theory and practice of programming, including the design, analysis, and implementation of programming languages and algorithms.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.