In Search of an Understandable Consensus Algorithm (Raft)

Paxos Made Simple

ZooKeeper: Wait-free coordination for Internet-scale systems

Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore

Impossibility of Distributed Consensus With One Faulty Process

Consensus in the presence of partial synchrony

Viewstamped Replication Revisited


Don’t be lazy, be consistent: Postgres-R, a new way to implement Database Replication

PacificA: Replication in Log-Based Distributed Storage Systems

Chain Replication for Supporting High Throughput and Availability

Byzantine Chain Replication

A Comprehensive Study of Convergent and Commutative Replicated Data Types

Optimistic Replication


Stronger Semantics for Low-Latency Geo-Replicated Storage (Eiger)

Calvin: Fast Distributed Transactions for Partitioned Database Systems

Sinfonia: a new paradigm for building scalable distributed systems

Understanding the Limitations of Causally and Totally Ordered Communication

A Response to Cheriton and Skeen’s Criticism of Causal and Totally Ordered Communication

MDCC: Multi-Datacenter Consistency

Spanner: Google’s globally distributed database


Transactional Memory: Architectural Support for Lock-Free Data Structures

Software Transactional Memory

Sharing Memory Robustly in Message-Passing Systems

Wait-free Synchronization


ZooKeeper’s atomic broadcast protocol: Theory and practice

Kafka (LinkedIn)

Omega: flexible, scalable schedulers for large compute clusters

Thialfi: A Client Notification Service for Internet-Scale Applications

Large-scale Incremental Processing Using Distributed Transactions and Notifications


Note: We haven’t included anything already covered in 6.824, but you should read those papers too.

Paxos Made Live: An Engineering Perspective

Viewstamped Replication: A new primary copy method to support highly-available distributed systems

Time, Clocks, and the Ordering of Events in a Distributed System

The Part-Time Parliament

Paxos Made Practical

The papers from SOSP 2013