Spanner

Spanner is a highly distributed, externally consistent database developed by Google. It provides replication and transactions over a geographically distributed set of servers. Spanner uses time bounds, Paxos, and two-phase commit to ensure external consistency.

Interesting Ideas

Spanner uses clocks with bounded uncertainty to provide synchrony between servers. It also shards an application’s data to provide fine-grained load distribution.

Spanner uses the TrueTime API (TT) to synchronize time between servers. A call to TT.now() gives a range guaranteed to contain the actual time.

Spanner has two kind of reads. The first reads the most recent value of a key or set of keys, called a read-only transaction. The second is a snapshot read which is executed at a specific timestamp in the past. Using a combination of Paxos leases and TrueTime guarantees to agree on a timestamp, Spanner can execute snapshot reads and read-only transactions without locks or two-phase commit.

Spanner performs schema changes atomically without blocking, by picking a future time for the change to occur. Other read and write operations choose a timestamp so that at each replica, the operation performs either before or after the schema change.

Subtleties

Since Spanner commits only when TT.now.after(timestamp) is true, we are guaranteed that from now on TT.now.latest() will always be larger than the committed timestamp on all servers.

Spanner very carefully chooses timestamps for RW transactions to ensure when they are safely visible. They call this the commit-wait rule.

Questions

Why is the write throughput so low? 4K ops/sec for 50 paxos servers of one replica each (so not running Paxos), not waiting for any other commit times, seems very low.
Why is the throughput experiment in 5.1 CPU bound?
What happens if we use logical time(which preserves causality) rather than the true time? Maybe external consistency breaks, but the system is still sequentially consistent.

6.824 notes