Raft

Raft is a new consensus algorithm that is optimized for “ease of implementation”. Its main purpose is to present a protocol that is more understandable than Paxos, which, for many practitioners, is difficult to implement correctly. Viewstamped Replication is more similar to Raft, however it is far less popular than Paxos, so it is unfortunately not focused on in the paper.

All of these consensus algorithms operate as long as a majority of servers are functioning, so you would require $2f+1$ servers to tolerate $f$ failures. Servers are assumed to fail by stopping, though they might recover with state from stable storage (such as disk). We read this paper because there are very few consensus algorithms, and something that is easier to understand than Paxos sounds great!

Raft pros and cons

The paper designs and ensures invariants around the data structure that Paxos, in practice, is used for – a replicated state machine log. This abstraction is nice, because it is easier to think about operating on a sequential log and ensuring a small number of properties rather than running multiple independent instances of Paxos. Raft assigns leaders on a per-term basis (i.e., epochs), and terms are used as an implicit coordination mechanism. It also supports configuration changes (removing or adding nodes from the system) while still serving requests.

The bulk of the paper is fairly digestible and it was nice that the mechanisms to ensure safety were concentrated in a small handful of subsections. However distributed consensus is hard, so verifying Raft’s safety still took effort. Many members of the reading group have a basic understanding of Paxos, so we didn’t necessarily feel that understanding Raft’s correctness was easier than understanding Paxos.

One of the group members noted that he wrote an implementation of Raft from the paper and some basic tests “just worked”, which was surprising. However, note that a production implementation of Raft would require many of the same things required to properly implement and run Paxos (see Paxos Made Live). Namely, you’d have to consider and handle disk corruptions, safely implement snapshots, and use something like leases so that reads don’t always require running a round of the consensus algorithm. These added features are necessary in a practical system, and would complicate a Raft implementation as much as a Paxos one.

In terms of performance, Raft can add log entries with $f+1$ roundtrips total, whereas Paxos requires $2(f+1)$ . Paxos can be further optimized by using leaders, and both systems can benefit from batching log entries. The space used for “understandability” experiments could have been better used to show more performance numbers or more explanation about correctness.

Main Subtleties

One of the most interesting parts of the algorithm is the commit point for a log entry, or the point at which you can determine that even if $f$ servers fail, the log entry will be present. It turns out that because nodes perform leader election by consulting and sending only their log length and term number, a log entry that is simply on the majority of the nodes can still be overwritten when a new leader is elected. The paper bolds the actual requirement:

A log entry may only be considered committed if the entry is stored on a majority of the servers; in addition, at least one entry from the leader’s current term must also be stored on a majority of servers.

Figure 7 shows how this can happen. Consider this case with five servers, S1, S2, S3, S4, and S5 (we use the notation Value:Term in each slot). S1 is leader in term 2, and replicates an entry X for slot 2, it goes to 2 servers (S1, S2).

slot    1     2
S1     1:1  X:2
S2     1:1  X:2
S3     1:1
S4     1:1
S5     1:1

S1 crashes before it can finish replicating X:2, S5 is elected leader for term 3 (with votes from S3, S4, and S5) and stores Y in slot 2

slot    1     2
S1     1:1  X:2
S2     1:1  X:2
S3     1:1
S4     1:1
S5     1:1  Y:3

S5 crashes before it can replicate Y, and S1 is elected leader again, now for term 4. S1 finishes replicating X:2, and adds a new entry, Z in slot 3, but crashes before it can completely replicate it

slot    1     2     3
S1     1:1  X:2   Z:4
S2     1:1  X:2  
S3     1:1  X:2
S4     1:1
S5     1:1  Y:3

S5 becomes leader for term 5 because it has the most recent term out of the remaining nodes. S5 will force a Y into slot 2 (Note that S5 has Y in slot 2 and S1,S2,S3 had an X in slot 2.)

slot    1     2     3
S1     1:1  X:2   Z:4
S2     1:1  Y:3 
S3     1:1  Y:3
S4     1:1
S5     1:1  Y:3

Note that if S1 had gotten Z into two additional logs (two out of S2, S3, S4), then S5 could never have been elected leader, because its log was not long enough.

We found this surprising.

Conclusions

There are some nice aspects of Raft – the log as a first order primitive, the method of configuration changes, and only two RPCs (though one could argue AppendEntry is heavily overloaded). Unfortunately, the authors’ main point falls flat for two reasons: First, they primarily compare their system to Paxos while their system is much closer to Viewstamped Replication. They claim this is reasonable because Paxos is much more widely used. Second, they claim Raft is much simpler and easier to implement (compared to Paxos). However, in our reading, Raft didn’t appear to be any less subtle (it took effort to convince ourselves of the correctness of the example explained here) and several things that make Paxos difficult to implement are still evidenced in Raft.

We don’t see many new consensus algorithms, so this was still very interesting to read and discuss.

Open Questions

Did they make any comparisons between Raft and Paxos that couldn’t have been made between Paxos and VR?