Raft: Understandable Consensus with an Explicit Leader

What This Concept Is

Raft (Ongaro and Ousterhout, 2014, USENIX ATC) is a consensus algorithm designed explicitly for understandability. It is operationally equivalent to Multi-Paxos but decomposes the problem into three independently-reasonable sub-problems:

Leader election: at any time, at most one leader per term. Elections run when the leader is suspected dead.
Log replication: the leader appends client commands to its log and replicates them to followers; entries become committed when a majority has stored them.
Safety: invariants prevent a new leader from overwriting committed entries.

Roles and Terms

Every node is in one of three states: Follower (default, passive), Candidate (trying to become leader), Leader (active, handles client requests).

Time is divided into terms, numbered monotonically. Each term has at most one leader; some terms have no leader (a failed election). Every RPC carries the sender's term; any node seeing a higher term immediately reverts to follower.

Leader Election

Followers reset a randomized election timeout (typically 150-300ms) on each heartbeat.
If no heartbeat within the timeout, follower becomes Candidate: increments its term, votes for itself, sends RequestVote RPCs to all peers.
Each node grants at most one vote per term. A candidate that collects votes from a majority becomes Leader and starts sending heartbeats.
Ties (split vote) resolve via randomized timeouts in the next term.

Log Replication

Leader appends the client command to its log at index i.
Leader sends AppendEntries(term, prevLogIndex, prevLogTerm, entries[], leaderCommit) to every follower.
Follower accepts only if prevLogIndex/prevLogTerm match its log (this is the consistency check). Mismatch: follower rejects, leader steps back one index and retries.
When a majority have stored entry i, leader marks i committed and advances commitIndex. The next heartbeat carries the new leaderCommit; followers apply entries up to that index to their state machines.

Safety Invariants

Election restriction: a node votes for a candidate only if the candidate's log is at least as up-to-date (last entry's term, then last index) as the voter's. Ensures the elected leader has every committed entry.
Leader append-only: a leader never deletes or overwrites entries in its own log.
Log matching: if two logs have an entry with the same index and term, their prefixes are identical.
Leader completeness: if an entry is committed in term T, it is present in every leader's log for terms > T.

Why It Matters Here

Raft is the default you will encounter in production:

etcd (Kubernetes control plane, CoreOS), Consul (HashiCorp), TiKV, CockroachDB, MongoDB replica sets (variant), Redis Raft, HashiCorp's raft library all use Raft.
It is the algorithm you should be able to trace on paper. If you cannot, you cannot debug Kubernetes, etcd, Consul, or any coordination-backed system when it misbehaves.

Concrete Example: A Full Election

5 nodes. Node 1 is leader at term 3. Node 1 crashes.

N2 is leader for term 4. Any later AppendEntries from N1 (if N1 recovers) will be rejected because N1's term 3 < current term 4; N1 steps down to follower.

Log Replication Snapshot

Now a client sends SET x = 5 to N2.

N2 log: [..., (6, t2, "...X..."), (7, t3, "..."), (8, t4, "SET x=5")]
N2 sends AppendEntries(term=4, prevLogIndex=7, prevLogTerm=3, entries=[(8, t4, "SET x=5")])
Followers N3, N4 apply the entry (their logs match at index 7 term 3).
N5 had crashed earlier and missed entry 7; rejects with "mismatch at prevLogIndex=7".
N2 backs up: sends AppendEntries with prevLogIndex=6, entries=[(7,t3,"..."), (8,t4,"SET x=5")].
N5 accepts.
Majority (N2, N3, N4, N5) have entry 8. N2 commits index 8; next heartbeat tells followers.

Common Confusion / Misconception

"Raft is leader-based, so the leader is a single point of failure." The leader is a single point of throughput, not of failure. If the leader dies, a new one is elected within an election timeout; committed data is preserved by the election restriction.

"Raft is simpler than Paxos, so it must be weaker." Raft and Multi-Paxos tolerate the same failure model. The simplicity is in the exposition and the operational model (explicit leader, contiguous log, clean membership change protocol), not in the safety properties.

"Raft guarantees exactly-once client semantics." It guarantees linearizable state-machine semantics. Client retry still requires idempotency (next cluster).

How To Use It

When you operate a Raft-backed system:

Size clusters with 3 or 5 voting members (2f+1 for f=1 or f=2). Never 2, never 4 without a learner role.
Tune the election timeout to 10-20x your p99 network RTT plus GC budget.
Place members across failure domains (zones) but within one region for latency.
Monitor leader_changes_total - flapping leaders almost always indicate a tuning problem.
For membership changes (adding/removing a node), use the joint consensus or single-server change protocol - never hot-swap peers.

Check Yourself

Why does a candidate need a majority of votes, not just more votes than any other candidate?
What is the election restriction, and what would go wrong without it?
Why does the log-matching property let AppendEntries "back up one index" until it matches?
In a 5-node cluster, how many failures can Raft tolerate while still making progress?

Mini Drill or Application

Draw a 5-node Raft cluster. Apply this sequence:

N1 is leader at term 1, replicates entries 1-3 to all.
Network partitions: {N1, N2} vs {N3, N4, N5}.
N3 times out; elects itself leader at term 2; replicates entries 4-5 to N4, N5.
Partition heals.

Which entries are committed, which are overwritten, and who is leader afterwards? Work it step by step.

What This Concept Is​

Roles and Terms​

Leader Election​

Log Replication​

Safety Invariants​

Why It Matters Here​

Concrete Example: A Full Election​

Log Replication Snapshot​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​