Thursday, August 11, 2016

Building robust distributed systems

Distributed systems are notoriously error-prone.  It's hard enough for programmers to reason about multiple actions occurring on different nodes, but they also need to understand the messages that are sent among these nodes.  (The problem becomes murkier yet in the presence of failures.  A distributed system is designed to be faster and/or more reliable than a single-node system, but a distributed system contains more computers and connections, so it suffers component failures more frequently than the single-node system!)

I have been pursing several approaches to aid in the creation of distributed systems.  Here are two of those projects.

To help programmers understand and debug distributed systems, my colleagues and I have created a visualization tool, ShiViz.  Programmers typically debug distributed systems by examining their logs, but the logs are hard to understand.  ShiViz creates a time-space diagram, a standard representation that programmers already use to represent the hosts and the ordering of messages between them.  See the accompanying pictures of a conceptual time-space diagram and how ShiViz represents an actual execution.  A time-space diagram shows when messages were sent and received, and thus it establishes causality:  if event E1 occurred at node N before any effect of event E2 reached node N, then event E1 did not depend on E2.  ShiViz works with existing logs, though it enhances them with causality information between events.  ShiViz takes information that programmers would otherwise have to grep as text, and presents it in a form that programmers can explore, search, and understand.  You can try ShiViz online, without having to install anything on your computer.  Our article "Debugging Distributed Systems" appeared this month in Communications of the ACM.  The article describes approaches to debugging distributed systems and shows when our visualization approach is a good choice.

To help programmers create distributed systems that are correct by construction, my colleagues and I have created Verdi, a framework for proving correctness of distributed systems.  Verdi is a heavyweight approach built upon the Coq proof assistant:  a programmer writes both source code and a proof of correctness for the source code, and thanks to the Curry-Howard isomorphism the program type-checks only if the proof is valid.  This gives strong guarantees.  While it is challenging to write such proofs for sequential code, it is much worse for distributed systems.  Our system, Verdi, enables a programmer to write and prove a sequential program in Coq.  Then Verdi transforms the program into a distributed system, and it transforms the proof into a proof of the distributed system.  This approach, which we call "verified system transformers", abstracts away the difficulty of writing and proving a distributed system.  Using Verdi, we created the first formally-verified implementation of the widely-used Raft consensus protocol.  Our PLDI 2015 paper describes Verdi, and our CPP 2016 paper describes the proof in more detail.  The CPP 2016 paper also explains the methodology that we developed while writing 50,000 lines of proof --- a methodology that significant eased the burden of using Coq and can be adopted by other researchers.