Raft: Understandable Consensus with an Explicit Leader
What This Concept Is
Raft (Ongaro and Ousterhout, 2014, USENIX ATC) is a consensus algorithm designed explicitly for understandability. It is operationally equivalent to Multi-Paxos but decomposes the problem into three independently-reasonable sub-problems:
- Leader election: at any time, at most one leader per term. Elections run when the leader is suspected dead.
- Log replication: the leader appends client commands to its log and replicates them to followers; entries become committed when a majority has stored them.
- Safety: invariants prevent a new leader from overwriting committed entries.
Roles and Terms
Every node is in one of three states: Follower (default, passive), Candidate (trying to become leader), Leader (active, handles client requests).
Time is divided into terms, numbered monotonically. Each term has at most one leader; some terms have no leader (a failed election). Every RPC carries the sender's term; any node seeing a higher term immediately reverts to follower.
Leader Election
- Followers reset a randomized election timeout (typically 150-300ms) on each heartbeat.
- If no heartbeat within the timeout, follower becomes Candidate: increments its term, votes for itself, sends
RequestVoteRPCs to all peers. - Each node grants at most one vote per term. A candidate that collects votes from a majority becomes Leader and starts sending heartbeats.
- Ties (split vote) resolve via randomized timeouts in the next term.
Log Replication
- Leader appends the client command to its log at index
i. - Leader sends
AppendEntries(term, prevLogIndex, prevLogTerm, entries[], leaderCommit)to every follower. - Follower accepts only if
prevLogIndex/prevLogTermmatch its log (this is the consistency check). Mismatch: follower rejects, leader steps back one index and retries. - When a majority have stored entry
i, leader marksicommitted and advancescommitIndex. The next heartbeat carries the newleaderCommit; followers apply entries up to that index to their state machines.
Safety Invariants
- Election restriction: a node votes for a candidate only if the candidate's log is at least as up-to-date (last entry's term, then last index) as the voter's. Ensures the elected leader has every committed entry.
- Leader append-only: a leader never deletes or overwrites entries in its own log.
- Log matching: if two logs have an entry with the same index and term, their prefixes are identical.
- Leader completeness: if an entry is committed in term T, it is present in every leader's log for terms > T.
Why It Matters Here
Raft is the default you will encounter in production:
- etcd (Kubernetes control plane, CoreOS), Consul (HashiCorp), TiKV, CockroachDB, MongoDB replica sets (variant), Redis Raft, HashiCorp's
raftlibrary all use Raft. - It is the algorithm you should be able to trace on paper. If you cannot, you cannot debug Kubernetes, etcd, Consul, or any coordination-backed system when it misbehaves.
Concrete Example: A Full Election
5 nodes. Node 1 is leader at term 3. Node 1 crashes.
N2 is leader for term 4. Any later AppendEntries from N1 (if N1 recovers) will be rejected because N1's term 3 < current term 4; N1 steps down to follower.
Log Replication Snapshot
Now a client sends SET x = 5 to N2.
N2 log: [..., (6, t2, "...X..."), (7, t3, "..."), (8, t4, "SET x=5")]
N2 sends AppendEntries(term=4, prevLogIndex=7, prevLogTerm=3, entries=[(8, t4, "SET x=5")])
Followers N3, N4 apply the entry (their logs match at index 7 term 3).
N5 had crashed earlier and missed entry 7; rejects with "mismatch at prevLogIndex=7".
N2 backs up: sends AppendEntries with prevLogIndex=6, entries=[(7,t3,"..."), (8,t4,"SET x=5")].
N5 accepts.
Majority (N2, N3, N4, N5) have entry 8. N2 commits index 8; next heartbeat tells followers.
Common Confusion / Misconception
"Raft is leader-based, so the leader is a single point of failure." The leader is a single point of throughput, not of failure. If the leader dies, a new one is elected within an election timeout; committed data is preserved by the election restriction.
"Raft is simpler than Paxos, so it must be weaker." Raft and Multi-Paxos tolerate the same failure model. The simplicity is in the exposition and the operational model (explicit leader, contiguous log, clean membership change protocol), not in the safety properties.
"Raft guarantees exactly-once client semantics." It guarantees linearizable state-machine semantics. Client retry still requires idempotency (next cluster).
How To Use It
When you operate a Raft-backed system:
- Size clusters with 3 or 5 voting members (2f+1 for f=1 or f=2). Never 2, never 4 without a learner role.
- Tune the election timeout to 10-20x your p99 network RTT plus GC budget.
- Place members across failure domains (zones) but within one region for latency.
- Monitor
leader_changes_total- flapping leaders almost always indicate a tuning problem. - For membership changes (adding/removing a node), use the joint consensus or single-server change protocol - never hot-swap peers.
Check Yourself
- Why does a candidate need a majority of votes, not just more votes than any other candidate?
- What is the election restriction, and what would go wrong without it?
- Why does the log-matching property let AppendEntries "back up one index" until it matches?
- In a 5-node cluster, how many failures can Raft tolerate while still making progress?
Mini Drill or Application
Draw a 5-node Raft cluster. Apply this sequence:
- N1 is leader at term 1, replicates entries 1-3 to all.
- Network partitions: {N1, N2} vs {N3, N4, N5}.
- N3 times out; elects itself leader at term 2; replicates entries 4-5 to N4, N5.
- Partition heals.
Which entries are committed, which are overwritten, and who is leader afterwards? Work it step by step.
Read This Only If Stuck
- Database Internals: Raft
- Database Internals: Leader Role in Raft
- Database Internals: Zookeeper Atomic Broadcast (ZAB)
- DDIA: Fault-Tolerant Consensus (Part 2)
- Ongaro & Ousterhout: In Search of an Understandable Consensus Algorithm (Raft paper, extended)
- raft.github.io: visualization and animated walkthrough (official)
- Diego Ongaro's dissertation: Consensus: Bridging Theory and Practice