EPaxos is a leaderless Paxos variant which tries to reduce latencies for a geo-distributed replica group by enabling the client to use the replica with the lowest round-trip latency as the operation leader, and optimistically skipping a round of replica communication by inter-operation conflict detection
Instead of explaining EPaxos in this post, I will go over an example of how the recovery protocol works. This example confused me for a while since it is not actually addressed in the SOSP’13 paper — the full recovery protocol is in the tech report. Therefore, I recommend you read the recovery protocol in the tech report before you start playing EPaxos with toy examples.
Failure and recovery scenario
Consider a replica group of five nodes. The fast-path quorum size for five replicas is , making the EPaxos fast-path quorum the same size as a simple majority. Let us name the replicas like so:
One situation that could be problematic occurs when a client successfully completes one operation (which has side-effects i.e. writes), then issues another one, but then some failure occurs before the second operation completes and the recovery protocol incorrectly orders the second operation before the first.
Suppose the first operation uses A as leader with C and D making up the other fast-path quorum members while the second operation uses B as leader with D and E as quorum members. Furthermore, let’s say that A fast-path commits operation 1, responding success to the client but crashing right before it sends “commit” messages to C and D. Then, B sends pre-accepts to D and E (without operation 1 in the deps because B doesn’t know about operation 1) and also crashes. The system has suffered two failures but must recover and continue operation since five replicas can tolerate two failures in EPaxos. However, the remaining replicas look like this (with subscripts representing which operation the replica has pre-accepted in its log):
At this point, each replica knows about only one of the two operations but not whether it is committed, and both operations conflict with each other (see the tech report section 6.2 number 6 for exactly what “conflicts” means). But when C, D, or E initiates recovery for operations 1 and 2, how does it know which operation, if any, was fast-path committed and in which order?
Turns out we can figure out this mess if we know the leaders for both operations — this is exactly what EPaxos does in this situation. If the leader of an operation, alpha, is in the fast-path quorum of a different operation, beta, then beta could not have been fast-path committed if the pre-accept messages for alpha do not contain beta in deps.
The recovery would proceed as follows. Suppose C initiates recovery for operation 1. First, it asks the fast-path quorum what their logs contain. C observes that at least replicas (C and D) have pre-accepted an operation with the identical default attributes, and then tries to convince other replicas to accept the operation until at least have accepted it. When C asks E to accept operation 1, E refuses since operation 1 conflicts with operation 2 and E has already pre-accepted operation 2. Therefore, C will defer the recovery of operation 1 (section 6.2, step 7-e) and instead try to recover the operation it conflicts with, operation 2.
For the recovery of operation 2, replicas C and D will not accept operation 2 and will respond with the conflicting operation and the identity of the leader for the conflicting operation. The recovering replica will now observe that the leader for operation 1, replica A, is in the fast-path quorum for operation 2, but that A clearly didn’t know about operation 2; otherwise it would have listed operation 2 as a dependency for operation 1 in the pre-accept messages. Thus operation 2 could not have been fast-path accepted. Operation 2 would then be filled with a no-op and the recovery of operation 1 would be resolved, concluding that operation 1 was fast-path committed (note that it is always safe to conclude that an operation was fast-path committed if there are no conflicts, even if the operation in question wasn’t).