There is a write performance trade off when consistently replicating across multiple data centers due to the high latency when sending messages between data centers (often > 100ms). Existing protocols use forms of two phase commit which incur 3 blocking round trips between data centers.

MDCC introduces a new cross data center transaction protocol for transactions over key-value stores (no range scans) that uses many forms of Paxos to achieve one synchronous cross-data center message in the common case.

A similar system, Megastore, uses per-shard Paxos groups to replicate between datacenters. Unfortunately, heavy conflicts in Megastore reduce performance to around four transactions per second.

MDCC Goals

MDCC aims to have the following properties:

• Read-commited isolation: Transactions never read records that have not been fully committed. The paper notes that this is the form that most commercial and open source databases provide, so ensuring this level of consistency is sufficient.
• No lost updates: This is when transaction $T_1$ reads a record $X$, another transaction $T_2$ commits an update to $X$, then $T_1$ commits a write to $X$ which overwrites $T_2$’s commit without having ever read it.
• Non-stale reads: Be able to ensure that reads are always of the most recent value.
• Low latency: Minimize number of synchronous round-trips — in this case to one.
• Costs scale to true transaction overlap: There should not be large, fixed costs incurred for each transaction commit. Rather, the costs should be proportional to the amount of actual contention between transactions.
• Integrity Constraints: Support some forms of integrity Constraints. The paper discusses FIELD >/=/< VALUE type constraints.

Key Ideas

• Paxos per record
• Awesome — could shuffle record ownership depending on access patterns!
• Optimistic concurrency control (keep write sets)
• During Paxos, rather than propose the new value, propose an option to update the new value. If the option succeeds, then a later write can safely set the value asynchronously.
• Move more logic into quorum nodes than in regular Paxos
• Decide write-write conflicts
• Enforce integrity constraints
• Reduce round-trips via
• Multi-Paxos: select master every $N$ commits
• Fast-Paxos: clients directly talk to quorum nodes
• Reduce conflicts
• Generalized Paxos: commutative operations
• “Demarcation protocol” conservatively figures out how many conservative ops are allowed
• Failure
• Send client state to every node on commit (every record’s quorum)
• write set + xact id

Questions

• How does a per-record protocol affect/enable fine-grained tuple placement?
• What to do to make repeatable reads work?
• How does Fast-Paxos work?
• How powerful/efficient is the demarcation protocol?
• How do you run range-queries or queries that need indexes?

Overview of MDCC.

The MDCC paper flows by identifying issues with the best solution proposed so far, and introducing a new technique to resolve the issues. This post will flow in a similar manner.

Per-record Paxos

Megastore is slow because it runs Paxos at a per shard granularity, causing false conflicts. What if we ran Paxos per record?

Paxos on a record runs as follows:

• up to 1 round-trip for client to tell master “I want to commit my transaction”
• 1 round-trip to elect master for record
• 1 round-trip to set the value (learn the value)
• 1 round-trip to make value visible (can be asynchronous)

Multi-record Transactions

The key idea is to add logic into the acceptor nodes to perform write-write detection.

If I want to commit records A and B, I run Paxos for each record. When running Paxos, instead of setting the values in the second round, send an option containing the record’s read and write versions. If there’s no write conflict, the node returns OK, otherwise NO

If I hear OK for every record, I tell each record’s quorums to write “commit” into their option log. Otherwise I write “abort”. This is different than SQL transaction aborts!.

Also, quorum nodes are only allowed to accept options if all previous options have been decided (written “commit” or “abort”). Otherwise write-write conflicts can’t be correctly detected.

Supporting Failures

The following sequence of actions could block all nodes participating in a transaction.

• I send options to acceptors for records A and B.
• They accept and respond to me. The options are undecided, so they can’t accept anything else until I write a “commit” or “abort” in the options.
• I go on vacation before writing anything.

I was the only one that knew what other records were updated in my transaction! So the participating nodes can’t make anymore progress until I’m back from vacation.

The solution is to send my transaction state (transaction id and write set) along with the options. That way if I disappear, each record’s quorum can make a clone of me using the transaction state.

At this point, multi-record transactions using per-record Paxos “works”. Everything after this point is to make Paxos consensus run faster

Multi-Paxos

We still require 3 round-trips to commit (send to master, elect master, pick option). We can use Paxos to elect a master for a block of rounds which amortizes the cost of electing a master to 0 (assuming the masters are stable).

The communication is now

• up to 1 round-trip for client to tell master “I want to commit my transaction”
• 0 round-trips (amortized) to elect master for record
• 1 round-trip to send option to acceptors
• 1 asynchronous round-trip to write “commit”/”abort” on options
• 1 asynchronous round-trip (can be bundled with other messages) to make commit visible

Fast-Paxos

We still need two synchronous round-trips — one to contact the master if it is in another datacenter, and one to send the options. Can we avoid the former message?

Yes! Fast-Paxos lets clients bypass the master to directly send options to the quorum nodes. Recall that the master’s job is to serialize the log entries. If we want to de away with the master, Fast-Paxos needs a larger quorum size (a fast quorum). If a fast quorum responds OK, then the client can consider the option committed. Otherwise it falls back to normal Paxos with some details.

None of us had read the Fast-Paxos paper so we were ill-equipped to work through why this section of the paper works. Read below for how we think Fast-Paxos works based on first principles. Read the Fast-Paxos paper to actually understand how it works.

Generalized-Paxos

If my operations are all commutative, I want to avoid the cost of serialization.

Normal Paxos only allows 1 value update per round. Generalized Paxos allows M as long as those values are commutative. For example, if in the same round David wants to increment record A by 1 and Jane wants to increment record A by 1, then the value of A is 2 regardless of the log order on any of the quorum nodes.

What this means is the acceptors aren’t simply writing values anymore, they also understand “delta operations”, how they commute, and how to apply them.

Integrity constrains

If I want to support integrity constraints (e.g., A >= 0), the acceptors need to understand and enforce them.

The paper restricts the domains of the delta operations (e.g., can only add/subtract 1 or 2) and conservatively computes the maximum number of operations that are allowed before it is at all possible (on a single node and any combinations of transactions in a quorum) that the constraint is violated.

This is useful because many transactions are simply updating counters.

That’s a lot of steps, are there any more tricks?

We usually want reads to go fast, but if I’m reading record A and A’s replica in my datacenter was not in the quorum, then it would give me stale data. One way to get around this is to make sure the replica in my data center is always in the quorum. This works if my data center is the only one reading this record. For example, if each data center is collecting the local weather data and replicating the data, it can be pretty sure that only people living around the data center will want the local data.

Fast-Paxos is really really bad (in terms of cross datacenter round-trips) if a collision happens. For those cases it makes more sense to stick with a master and use Multi-Paxos. Figuring out when to use each is important. The paper provides a heuristic that sticks with Multi-Paxos and periodically tries to “upgrade” to Fast-Paxos.

That’s it! MDCC collects a number of Paxos variants and puts them together into a nice cross data-center commit protocol.

A Note on Fast-Paxos

The following was derived from first principles.

The Fast-Paxos protocol was described (briefly) in the paper as

Any app-server can propose an option directly to the storage (quorum) nodes, which in turn promise only to accept the first proposed option. Simple majority quorums, however, are no longer sufficient to learn a value and ensure safeness… If a proposer receives an acknowledgment from a fast quorum, the value is safe and guaranteed to be committed.

A classical Paxos quorum requires the property that for any two quorums, $Qc_1$ and $Qc_2$,

For Fast-Paxos it has the requirement that for any two fast quorums, $Qf_1$ and $Qf_2$, and regular quorum $Qc$,

One of our questions was why a simple majority, $Qc$, is insufficient and we need a $Qf \supset Qc$. The following is an example of when it fails.

Safety, or correctness, in Paxos means that “At most one value can be learned”. One example is that even with $f$ failures, a committed log position stays committed.

Lets say $Qf = Qc$ is a majority. Consider the following logical proposals and their quorums where each column is the state of an acceptor (1-5), and each row is the state at one Paxos round. C1, C2, C3 are clients (proposers). C1 commits A and manages to communicate with acceptors 1-3, C2 commits B and so on.

    1 2 3 4 5
C1  A A A
C2      B B B
C3  C C   C


Say C1 and C2 happen concurrently, and C3 happens later (as a normal Paxos round, or Fast-Paxos). Then the actual acceptances and what round each node thinks its in may be

 1 2 3 4 5 Round (as understood by local node)
A A A B B 1
B     2


Now if C3 tries to set C, its quorum could be 1,2,4 which don’t know that A won in round 1 and B “won” in round 2, because they have only seen values for round 1, so the nodes happily accept C in (what they think is) round 2. This effectively overwrites B, violating safety.

 1 2 3 4 5 Round (as understood by local node)
A A A B B 1
C C B C   2