Papers
Consensus
In Search of an Understandable Consensus Algorithm (Raft)
ZooKeeper: Wait-free coordination for Internet-scale systems
Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore
Impossibility of Distributed Consensus With One Faulty Process
Consensus in the presence of partial synchrony
Viewstamped Replication Revisited
Replication
Don’t be lazy, be consistent: Postgres-R, a new way to implement Database Replication
PacificA: Replication in Log-Based Distributed Storage Systems
Chain Replication for Supporting High Throughput and Availability
A Comprehensive Study of Convergent and Commutative Replicated Data Types
Causality/Transactions
Stronger Semantics for Low-Latency Geo-Replicated Storage (Eiger)
Calvin: Fast Distributed Transactions for Partitioned Database Systems
Sinfonia: a new paradigm for building scalable distributed systems
Understanding the Limitations of Causally and Totally Ordered Communication
A Response to Cheriton and Skeen’s Criticism of Causal and Totally Ordered Communication
MDCC: Multi-Datacenter Consistency
Spanner: Google’s globally distributed database
Concurrency
Transactional Memory: Architectural Support for Lock-Free Data Structures
Sharing Memory Robustly in Message-Passing Systems
Industry
ZooKeeper’s atomic broadcast protocol: Theory and practice
Omega: flexible, scalable schedulers for large compute clusters
Thialfi: A Client Notification Service for Internet-Scale Applications
Large-scale Incremental Processing Using Distributed Transactions and Notifications
Other
Note: We haven’t included anything already covered in 6.824, but you should read those papers too.
Paxos Made Live: An Engineering Perspective
Viewstamped Replication: A new primary copy method to support highly-available distributed systems
Time, Clocks, and the Ordering of Events in a Distributed System
The papers from SOSP 2013