In his POPL 2021 keynote, Rachid Guerraroui tells the story of “the lost universality”, which happened when computing had to “go big from single to several machines that communicate via message passing” in order to tolerate failures via replication. The move brought up the need for many machines to reach consensus. But systems that rely on consensus must take care: reaching consensus can have prohibitive costs in the presence of partition failures. Yet, ad-hoc dropping of consensus is not the answer. There are other ways to mitigate the downsides. Sometimes consensus can be avoided entirely, e.g., for asynchronous problems such as the payment problem in blockchains and the learning problem. Other times, the risk of partition failures is low, e.g., in data centers with reliable networks, so inexpensive algorithms can yield almost-always consistent and available results. Finally, in mission-critical parts of a system consensus can simply not be circumvented.
There is a programming/design question here: How to decide whether consensus is needed and, if yes, which kind? Guerraroui closes the keynote with a call to the PL community to help answer this question “to write better applications for the internet machine”.
In this blog post, I want to convey the message that PL research has indeed to be in a driver’s seat on the road to the “universal internet machine”. No matter how many clever distributed protocols there are, we won’t probably get there safely and productively — unless we offer application developers abstractions that enable them to manifest “just enough” domain knowledge to the language machinery in order for the latter to automate much of low-level reasoning about complex operation interleaving that is hard to do and easy to get wrong, thus providing the convenience of automatically maintained correctness. The post aims to emphasize the importance of PL research for distributed systems and to encourage more PL research in this domain. I presented some of its content recently in a keynote at DisCoTec 2021.
The Challenge of Maintaining Distributed Application Consistency
Today’s internet machines have grown in complexity compared to traditional machines — the result of “going big for increasing robustness and scalability”. Guerarroui presents a scenario of distributed clouds and clusters under the control of systems experts in tech companies. Today’s internet machine consists of millions of devices in the wild connected to each other and to web, mobile and cloud platforms via programmable networks and runs massively distributed applications such as social apps (e.g., instant messengers, multiplayer games, etc.), collaborative workflows (e.g., Google Docs, Figma, Trello, etc.), real-time businesses (e.g., cross-organizational workflows, collaborative condition monitoring, etc.), decentralized data ecosystems processing data owned by different stakeholders, and more.
These machines are not only “big” – they are also “out-of-control” in two ways. First, they are out of the control of the distributed systems experts, woven into the physical and social worlds. Distribution is not only a technical means to an end (availability, scalability) — it is an inherent feature of the world (people, devices, data sources are distributed); likewise, concurrency is due to people and devices acting inherently autonomously in time and space. Second, computation flows are driven by external events and interactions and not controlled by the applications; The open, dynamic, and interactive nature of massively distributed applications makes consensus even more difficult. Today, the problem of achieving distributed consensus efficiently is often circumvented by employing centralized state management – apps are typically just thin interfaces to processing logic managed in central servers/clouds of platform providers. But, centralization is not a good fit for offline availability, low latency, proper usage of communication ressources, user-controlled data privacy, and more. We need to replace centralized state management with decentralized consistency management as much as possible, employing none, or eventual consistency whenever possible for performance and only rely on consensus whenever necessary for universality.
Data storage systems exist that offer multiple consistency levels such as Gemini or Pileus. But, here is the problem: they leave it to developers to select and integrate the right consistency level needed at a specific part of the application. Deciding about where to employ what consistency level is challenging and even for small self-contained applications and composing application parts with different guarantees can have unanticipated results. Instead the developers should declare safety properties to uphold, from which decisions about consistency levels need to be automatically derived to optimize for correctness, performance, and productivity. I argue that PLs can and should come to rescue here and perform this automation.
The quest is for language machinery that is guided by application semantics to automatically reason about where to employ consensus, thereby providing well-defined and formally-proven consistency guarantees “out-of-the-box”. What programming abstractions should this machinery offer to developers, to express the application semantics? Several proposals for programming languages and frameworks target this question, differing on how they achieve consistency and on the level of the expected developer involvement.
Programmatic Choice of Consistency Levels
Approaches in this category expose consistency levels as language abstractions. With RedBlue Consistency programmers manually label their operations to be either blue (eventually consistent) or red (strongly consistent); Myter et al., Milano and Myers, Köhler et al. let developers choose between consistency and availability. Overall, these approaches expect developers to decide about which operation needs which consistency level, a non-trivial and error-prone decision. A language-based solution should relieve developers from the burden and automatically infer the necessary consensus, which approaches in the next two categories seek to achieve.
Automating Consistency by Prescribing the Programming Model
Approaches in this category prescribe programming models that avoid consistency problems. Lasp e.g., forces programmers to explicitly model an application’s data flow via combinator functions on conflict-free replicated data types (CRDTs) and, in turn, ensures eventual consistency without synchronization. As the name suggests, replicas of a CRDT can be merged without consensus with strong eventual consistency semantics, facilitating the implementation of systems for asynchronous problems that are solvable without consensus. Bloom’s model also facilitates the implementation of solutions for asynchronous problems; following the CALM theorem stating that a problem has a consistent consensus-free distributed implementation iff it’s monotonic, Bloom permits only logically monotonic programs; it automatically detects and prevents non-monotonicity.
Another line of work relies on a reactive programming (RP) model – the one implemented in REScala – a library-based extension to Scala. Reactive programming is a good match for massively distributed interactive applications as it enables direct style encoding of asynchronous interactions. The work on thread-safe RP exploits RP’s “glitch-freedom” property (ensuring that the propagation of updates through affected computations happens in one logical step without observable intermediate results) and extends it to concurrent and distributed settings: Reactive updates by-default execute as if the were concurrent transactions with abort-free strict serializability.
This is great and local performance was shown reasonable, but strong consistency does not tolerate partitions. To embrace them, fault-tolerant RP encapsulates mergeable semantics of CRDTs into reactive abstractions, making them subject to REScala’s composable reactive updates, and guarantees consensus-free strong eventual consistency for the entire application. This language design retains modularity and composability, which are challenging when application developers use CRDTs directly.
In summary, Lasp and Bloom automate consensus-free strong eventual consistency at the price of limiting expressiveness, disallowing arbitrary compositions of distributed data types. Also, they don’t offer stronger consistency as a built-in feature, i.e., lacking automation for applications that demand it. REScala relies on a less restrictive programming style. Its semantics (a) is compatible with any consensus-free mergeable data type or protocol with mergeable semantics, and (b) allows arbitrary compositions thereof through composable reactive updates. By exchanging REScala’s update propagation, developers can programmatically choose strong or strong eventual consistency for parts of their application, but the selection is not automated.
Automatically Deriving Hybrid Consistency Levels from Safety Invariants
Methods for formally reasoning about consistency exist, e.g., Blazes, explicit consistency, reasoning for replicated databases, Q9, SIEVE, Lucy, and more. The goal is to integrate them and hybrid consistency data systems within programming languages. CAROL comes close to this idea. It uses CRDTs to replicate data and a refinement typing discipline for expressing safety properties. The type system uses pre-defined data types with consistency guards to spot invariant violations. The Z3 SMT solver verifies the compatibility of datatype operations and consistency guards. CAROL hides much of the complexity of reasoning about hybrid consistency levels. But the abstraction breaks once we need functionality not covered by the predefined data types. I believe the composability of reactive updates to be an enabler for elevating these limitations. It remains to be explored how to extend (a) the reactive programming model with means to express safety properties to be maintained during the execution, and (b) the language machinery with automated reasoning to detect places where consensus is needed to uphold invariants along with automatic integration of consensus levels.
Concluding Remarks and Outlook
I hope the blog post conveyed ideas about how PL research is helping and can help pave the road to the “universal internet machine”. While questions remain to be answered towards correct-by-design distributed applications — questions like: what are the “proper abstractions” and how to get there, how to achieve efficient automation, how to enable flexibility in terms of correctness notions and of respective reasoning and enforcement techniques, and many more — I would like to conclude by mentioning another domain that I believe would greatly benefit from PL-based automation: Machine Learning and the broader AI. Quoting from a paper by the Julia/Flux team: “ML engineering is fundamentally a programming languages problem”. Foundations for differential programming are great stepping-stones in automating ML. But there are many more stepping stones to jump on towards an AI programming paradigm and on the crossroads of the latter with the domain of this blog post, on the road to the “intelligent internet machine”. These are exciting times for PL research.
Bio: Mira Mezini is a Full Professor of Computer Science at the Technical University of Darmstadt in Germany where she leads the software technology research group (STG). Her current research interests are language design and program analysis and their applications in distributed computing, security, and artificial intelligence.
Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.