Build Your Own Consensus (Raft)
"Paxos is hard. Raft is hard too, but it's hard in places where you can debug it." — Diego Ongaro (paraphrased)
Implementing Raft is the most challenging distributed-systems project you can take on at undergraduate level. It is also the one that pays back the most. After implementing Raft, every distributed system you read about — etcd, Consul, CockroachDB, TiDB, Kafka KRaft, MongoDB — uses fundamentally the same vocabulary you now own.
1. Overview & motivation
Raft solves distributed consensus: getting N nodes to agree on a sequence of values (a replicated log), in the presence of network partitions, message loss, and node crashes — as long as a majority of nodes are alive and connected.
Raft has three subproblems:
- Leader election — at any given time, exactly one node is leader. Others are followers.
- Log replication — leader receives client commands, appends to its log, replicates to followers.
- Safety — once a command is committed (replicated to a majority), it is never lost; once a command is applied to the state machine, no different command will be applied at the same log position.
What you can only learn by building one:
- Why majority quorum is the right tradeoff (
f+1of2f+1survivesffailures). - Why log indexing and term numbers together prevent inconsistency.
- Why leader election timeouts must be randomized — without it, livelock.
- Why the corner cases of Raft are surprising — committing entries from previous terms, the
nextIndex/matchIndexdance, configuration changes. - Why every distributed-systems engineer treats Lamport's "Paxos Made Live" and Ongaro's Raft paper as required reading.
2. Where this fits in the degree
- Phase: Architecture
- Semester: 6 (Databases and Distributed Systems)
- Modules deepened: Module 5 (distributed fundamentals) — this is the canonical distributed systems project.
Cross-phase relevance:
- Direct upgrade path from the Kafka-like tutorial — leader election is what Raft does.
- Contrast with the Blockchain tutorial — Nakamoto consensus vs Raft are different consensus families.
3. Prerequisites
- Strong systems language: Go, Rust, or Java.
- Complete the Kafka-like tutorial first — you'll deeply appreciate why leader election matters.
- Comfort with state machines.
- Patience. Raft is not a weekend project.
4. Theory & research
Required reading
- Diego Ongaro & John Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)" (raft.github.io/raft.pdf) — the Raft paper. 18 pages. Print it. Read it three times. ⭐ the canonical source.
- Diego Ongaro, "Consensus: Bridging Theory and Practice" (PhD thesis, 2014) — 250 pages. Deeper than the paper. Section 4 covers all the corner cases the paper glosses.
Strongly recommended
- Raft visualization (raft.github.io) — interactive web demo. Use it constantly.
- The Secret Lives of Data: Raft (thesecretlivesofdata.com/raft/) — animated walkthrough.
- MIT 6.5840 (formerly 6.824) Distributed Systems (pdos.csail.mit.edu/6.5840/) — labs include a graded Raft implementation. The single best teaching path. Free recordings on YouTube.
- Eli Bendersky, "Implementing Raft" series (eli.thegreenplace.net/2020/implementing-raft-part-1-elections/) — 5-part Go tutorial. ⭐ recommended primary.
For context (Paxos and related)
- Leslie Lamport, "The Part-Time Parliament" (1998) — the original Paxos paper. Famous for being inscrutable.
- Lamport, "Paxos Made Simple" (2001) — Lamport's own re-explanation. Still tough.
- Chandra, Griesemer, Redstone, "Paxos Made Live" (2007) — Google's actual experience implementing Paxos. Cured the field of "Paxos is simple" claims.
Production Raft implementations
- etcd's Raft (github.com/etcd-io/raft) — extracted as a library. Used by Kubernetes, CockroachDB, etc.
- Hashicorp Raft (github.com/hashicorp/raft) — used by Consul, Vault, Nomad.
5. Curated tutorial list (from BYO-X)
The BYO-X catalog does not have a dedicated "Consensus" or "Raft" category. Coverage shows up across categories. The recommendations here come from the broader Raft community.
Canonical references
- Eli Bendersky, "Implementing Raft" — 5 parts: elections, command processing, persistence and optimizations, key-value DB, exactly-once delivery. ⭐ recommended primary.
- MIT 6.5840 labs 2A–2D — the graded Raft labs. The test suite is the gold standard for verifying a Raft implementation.
- Jon Gjengset, "Demystifying Consensus" (video lectures + Raft impl in Rust).
- Phil Eaton, "Implementing the Raft distributed consensus protocol in Go" (notes.eatonphil.com/2023-05-25-raft.html).
6. Recommended primary path
Eli Bendersky's "Implementing Raft" series (Go), supplemented by the Raft paper.
Five parts:
- Elections — leader election only.
- Command processing — log replication, simple consensus.
- Persistence and optimizations — restart safely, faster recovery.
- KV DB — apply Raft to build a replicated KV store.
- Exactly-once delivery — handle duplicate client requests after leader failover.
After each part, your implementation should pass the corresponding behaviors.
If you want graded feedback: do MIT 6.5840 labs 2A–2D. Their test suite is brutal and finds bugs in ways your own tests won't.
For Rust: Jon Gjengset's lecture series is well-paced.
7. Implementation milestones
Milestone 1: Leader election (Raft paper §5.2)
Every node is in one of three states: Follower, Candidate, Leader. A node times out → becomes Candidate → requests votes → if majority votes → becomes Leader.
type RaftState int
const (
Follower RaftState = iota
Candidate
Leader
)
type ConsensusModule struct {
id int
peerIds []int
currentTerm int
votedFor int
state RaftState
log []LogEntry
electionResetEvent time.Time
// ...
}
func (cm *ConsensusModule) runElectionTimer() {
timeoutDuration := cm.electionTimeout() // 150-300ms randomized
cm.mu.Lock()
termStarted := cm.currentTerm
cm.mu.Unlock()
ticker := time.NewTicker(10 * time.Millisecond)
defer ticker.Stop()
for {
<-ticker.C
cm.mu.Lock()
if cm.state != Candidate && cm.state != Follower {
cm.mu.Unlock()
return
}
if termStarted != cm.currentTerm {
cm.mu.Unlock()
return
}
if elapsed := time.Since(cm.electionResetEvent); elapsed >= timeoutDuration {
cm.startElection()
cm.mu.Unlock()
return
}
cm.mu.Unlock()
}
}
RequestVote RPC — peer asks for a vote. Grant if voter hasn't voted this term and candidate's log is at least as up-to-date as voter's.
Evidence: Bring up 3 nodes. Within seconds, one is elected leader. Kill the leader; another is elected within seconds.
Milestone 2: Log replication (Raft paper §5.3)
Leader receives client commands, appends to its log, sends AppendEntries to followers with the new entries. Followers append if consistent with their existing log; reject otherwise.
type AppendEntriesArgs struct {
Term int
LeaderId int
PrevLogIndex int
PrevLogTerm int
Entries []LogEntry
LeaderCommit int
}
Crucial consistency check: a follower accepts entries only if its log already has an entry at PrevLogIndex with term PrevLogTerm. If not, reject. Leader will retry with an earlier PrevLogIndex.
Evidence: Submit a command to the leader; all 3 nodes' logs converge to contain it.
Milestone 3: Commit and apply
Once an entry is replicated to a majority of nodes, leader marks it committed. Leaders track commitIndex; followers learn it via AppendEntries's LeaderCommit field. Each node applies committed entries to its local state machine.
Crucial: a leader may only commit entries from its own term (Raft paper §5.4.2). This is the most subtle correctness rule.
Evidence: A command is acked to the client only after a majority has it. Killing the leader does not lose acked commands.
Milestone 4: Persistence
Persist currentTerm, votedFor, and log to disk before responding to any RPC. Without this, a node that crashes and restarts could vote twice in the same term — breaking safety.
// Before responding to any RPC, persist state.
func (cm *ConsensusModule) persistToStorage() {
var termData bytes.Buffer
if err := gob.NewEncoder(&termData).Encode(cm.currentTerm); err != nil { log.Fatal(err) }
cm.storage.Set("currentTerm", termData.Bytes())
// ...same for votedFor and log
}
Evidence: Kill all 3 nodes mid-write. Restart. Log is consistent.
Milestone 5: Apply to a state machine (KV store)
Wire up a key-value store as the replicated state machine. Every committed Raft entry is {op: put|get|delete, key, value} applied to a local hashmap on every node.
Now you have a replicated KV store that survives node failures. This is what etcd and Consul do.
Evidence: From any node, put(foo, bar), then get(foo) from any other node returns bar. Kill a node; reads still work from the others.
Milestone 6: Linearizable client semantics
After a leader fails over, a client's in-flight request may have been committed but never acknowledged. The client retries. Without care, the command is applied twice.
Solution: each client request has a unique ID. The leader records applied IDs. Duplicate requests return the cached result instead of re-applying.
This is what Bendersky's Part 5 covers. Exactly-once semantics.
Evidence: A client retries during a leader failover; the command is applied exactly once.
Milestone 7 (optional, hard): Log compaction / snapshots
Logs grow forever. Eventually you must snapshot the state machine and discard old log entries. Followers that fall too far behind receive a snapshot, not historical entries.
Milestone 8 (optional, hard): Membership changes
Adding or removing a node from the cluster without downtime. Joint consensus (Raft paper §6) is the textbook approach.
8. Tests & evidence
| Test | How |
|---|---|
| Single-node | Trivially handles its own consensus |
| 3-node steady state | Commands replicate; offsets advance |
| Leader failure | Failover within ~2× election timeout; no data loss |
| Follower failure | Cluster continues; failed node recovers state on rejoin |
| Network partition | Minority side can't make progress; majority side proceeds; on reunification, minority's recent log truncated |
| Persistence | Kill all nodes; restart; state recovers |
| Stress | Continuous writes during random kills; final state consistent across all alive nodes |
| MIT 6.5840 tests | Pass labs 2A, 2B, 2C, 2D |
The single strongest evidence: passing MIT 6.5840's Raft test suite. If your code passes 2A through 2D, it's a real Raft implementation. The test suite catches subtle bugs that hand-rolled tests miss.
9. Common pitfalls
- Forgetting persistence before responding. Voting twice in the same term across a restart violates safety.
- Committing entries from previous terms. Raft §5.4.2. The single subtlest correctness rule. Easy to get wrong.
- Election timeouts not randomized. With identical timeouts, candidates split votes forever (livelock).
nextIndexregression. When AppendEntries is rejected, decrementnextIndexand retry. The leader must eventually find the agreement point.- Goroutine leaks. Election timers, replication loops — each node has many concurrent loops. Easy to leak.
- Holding locks during RPC calls. Will deadlock if the peer is also calling you back. Drop the lock before sending RPCs.
- Confusing leader's
commitIndexwith applied state. The leader knows what's committed; the state machine knows what's applied. Separate. - Reading stale data from followers. Reads must go through the leader or use a "read index" protocol.
- Trying to optimize before correctness. Bendersky's Part 3 covers optimizations. Don't add them in Part 1.
10. Extensions
- Read leases for leaders. Avoid an AppendEntries round-trip per read.
- Batching. Coalesce multiple commands into one AppendEntries.
- Pipelining. Send AppendEntries N+1 before N's reply arrives.
- Snapshots. Already noted. Required for production.
- Membership changes. Already noted. Required for production.
- Pre-vote. Prevents a partitioned node from disrupting the cluster on rejoin.
A "production-ready" Raft is much more work than a "passes the tests" Raft. etcd's Raft library has been refined for a decade.
11. Module integration
| Module | What Raft deepens |
|---|---|
| Sem 6 Module 3 — Replication & partitioning | Raft replicates a single log; sharding partitions data across multiple Rafts. |
| Sem 6 Module 4 — Transactions & consistency | Raft provides linearizable consistency. Foundation for distributed transactions. |
| Sem 6 Module 5 — Distributed fundamentals | Raft is the canonical project for this module. |
| Kafka-like tutorial | Leader election in Kafka is now done via Raft (KRaft). |
| Database (KV) tutorial | Apply Raft on top → replicated KV store (etcd's design). |
| Blockchain tutorial | Nakamoto consensus vs Raft — different families, illuminating contrast. |
12. Portfolio framing
What to publish:
- Source organized as
raft/{state,rpc,storage,timer}.go,kv/,client/. - A test report with results from MIT 6.5840 labs or equivalent.
- A README with:
- State diagram (Follower / Candidate / Leader).
- A list of corner cases handled.
- The cluster benchmarks (commit latency, throughput).
Reviewer entry points:
raft/state.go— the state machine.raft/election.go— leader election.raft/replication.go— log replication.tests/lab_2c.log— the persistence test results.- README must include: state diagram, list of papers consulted, honest scope statement.
A working Raft is one of the most distinguishing portfolio pieces in distributed systems. It is also notoriously a place where people say they've implemented Raft but actually haven't gotten the corner cases right. Passing MIT 6.5840's tests is the credibility marker.
Source
This tutorial draws on the BYO-X catalog (distributed-systems entries) and the broader Raft literature. The Ongaro/Ousterhout paper is the canonical primary source; Eli Bendersky's 5-part series and MIT 6.5840 are the canonical primary tutorial paths.