Skip to main content

Build Your Own Consensus (Raft)

"Paxos is hard. Raft is hard too, but it's hard in places where you can debug it." — Diego Ongaro (paraphrased)

Implementing Raft is the most challenging distributed-systems project you can take on at undergraduate level. It is also the one that pays back the most. After implementing Raft, every distributed system you read about — etcd, Consul, CockroachDB, TiDB, Kafka KRaft, MongoDB — uses fundamentally the same vocabulary you now own.


1. Overview & motivation

Raft solves distributed consensus: getting N nodes to agree on a sequence of values (a replicated log), in the presence of network partitions, message loss, and node crashes — as long as a majority of nodes are alive and connected.

Raft has three subproblems:

  1. Leader election — at any given time, exactly one node is leader. Others are followers.
  2. Log replication — leader receives client commands, appends to its log, replicates to followers.
  3. Safety — once a command is committed (replicated to a majority), it is never lost; once a command is applied to the state machine, no different command will be applied at the same log position.

What you can only learn by building one:

  • Why majority quorum is the right tradeoff (f+1 of 2f+1 survives f failures).
  • Why log indexing and term numbers together prevent inconsistency.
  • Why leader election timeouts must be randomized — without it, livelock.
  • Why the corner cases of Raft are surprising — committing entries from previous terms, the nextIndex/matchIndex dance, configuration changes.
  • Why every distributed-systems engineer treats Lamport's "Paxos Made Live" and Ongaro's Raft paper as required reading.

2. Where this fits in the degree

  • Phase: Architecture
  • Semester: 6 (Databases and Distributed Systems)
  • Modules deepened: Module 5 (distributed fundamentals) — this is the canonical distributed systems project.

Cross-phase relevance:

  • Direct upgrade path from the Kafka-like tutorial — leader election is what Raft does.
  • Contrast with the Blockchain tutorial — Nakamoto consensus vs Raft are different consensus families.

3. Prerequisites

  • Strong systems language: Go, Rust, or Java.
  • Complete the Kafka-like tutorial first — you'll deeply appreciate why leader election matters.
  • Comfort with state machines.
  • Patience. Raft is not a weekend project.

4. Theory & research

Required reading

  • Diego Ongaro & John Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)" (raft.github.io/raft.pdf) — the Raft paper. 18 pages. Print it. Read it three times. ⭐ the canonical source.
  • Diego Ongaro, "Consensus: Bridging Theory and Practice" (PhD thesis, 2014) — 250 pages. Deeper than the paper. Section 4 covers all the corner cases the paper glosses.
  • Leslie Lamport, "The Part-Time Parliament" (1998) — the original Paxos paper. Famous for being inscrutable.
  • Lamport, "Paxos Made Simple" (2001) — Lamport's own re-explanation. Still tough.
  • Chandra, Griesemer, Redstone, "Paxos Made Live" (2007) — Google's actual experience implementing Paxos. Cured the field of "Paxos is simple" claims.

Production Raft implementations


5. Curated tutorial list (from BYO-X)

The BYO-X catalog does not have a dedicated "Consensus" or "Raft" category. Coverage shows up across categories. The recommendations here come from the broader Raft community.

Canonical references

  • Eli Bendersky, "Implementing Raft" — 5 parts: elections, command processing, persistence and optimizations, key-value DB, exactly-once delivery. ⭐ recommended primary.
  • MIT 6.5840 labs 2A–2D — the graded Raft labs. The test suite is the gold standard for verifying a Raft implementation.
  • Jon Gjengset, "Demystifying Consensus" (video lectures + Raft impl in Rust).
  • Phil Eaton, "Implementing the Raft distributed consensus protocol in Go" (notes.eatonphil.com/2023-05-25-raft.html).

Eli Bendersky's "Implementing Raft" series (Go), supplemented by the Raft paper.

Five parts:

  1. Elections — leader election only.
  2. Command processing — log replication, simple consensus.
  3. Persistence and optimizations — restart safely, faster recovery.
  4. KV DB — apply Raft to build a replicated KV store.
  5. Exactly-once delivery — handle duplicate client requests after leader failover.

After each part, your implementation should pass the corresponding behaviors.

If you want graded feedback: do MIT 6.5840 labs 2A–2D. Their test suite is brutal and finds bugs in ways your own tests won't.

For Rust: Jon Gjengset's lecture series is well-paced.


7. Implementation milestones

Milestone 1: Leader election (Raft paper §5.2)

Every node is in one of three states: Follower, Candidate, Leader. A node times out → becomes Candidate → requests votes → if majority votes → becomes Leader.

type RaftState int
const (
Follower RaftState = iota
Candidate
Leader
)

type ConsensusModule struct {
id int
peerIds []int
currentTerm int
votedFor int
state RaftState
log []LogEntry

electionResetEvent time.Time
// ...
}

func (cm *ConsensusModule) runElectionTimer() {
timeoutDuration := cm.electionTimeout() // 150-300ms randomized
cm.mu.Lock()
termStarted := cm.currentTerm
cm.mu.Unlock()

ticker := time.NewTicker(10 * time.Millisecond)
defer ticker.Stop()
for {
<-ticker.C
cm.mu.Lock()
if cm.state != Candidate && cm.state != Follower {
cm.mu.Unlock()
return
}
if termStarted != cm.currentTerm {
cm.mu.Unlock()
return
}
if elapsed := time.Since(cm.electionResetEvent); elapsed >= timeoutDuration {
cm.startElection()
cm.mu.Unlock()
return
}
cm.mu.Unlock()
}
}

RequestVote RPC — peer asks for a vote. Grant if voter hasn't voted this term and candidate's log is at least as up-to-date as voter's.

Evidence: Bring up 3 nodes. Within seconds, one is elected leader. Kill the leader; another is elected within seconds.

Milestone 2: Log replication (Raft paper §5.3)

Leader receives client commands, appends to its log, sends AppendEntries to followers with the new entries. Followers append if consistent with their existing log; reject otherwise.

type AppendEntriesArgs struct {
Term int
LeaderId int
PrevLogIndex int
PrevLogTerm int
Entries []LogEntry
LeaderCommit int
}

Crucial consistency check: a follower accepts entries only if its log already has an entry at PrevLogIndex with term PrevLogTerm. If not, reject. Leader will retry with an earlier PrevLogIndex.

Evidence: Submit a command to the leader; all 3 nodes' logs converge to contain it.

Milestone 3: Commit and apply

Once an entry is replicated to a majority of nodes, leader marks it committed. Leaders track commitIndex; followers learn it via AppendEntries's LeaderCommit field. Each node applies committed entries to its local state machine.

Crucial: a leader may only commit entries from its own term (Raft paper §5.4.2). This is the most subtle correctness rule.

Evidence: A command is acked to the client only after a majority has it. Killing the leader does not lose acked commands.

Milestone 4: Persistence

Persist currentTerm, votedFor, and log to disk before responding to any RPC. Without this, a node that crashes and restarts could vote twice in the same term — breaking safety.

// Before responding to any RPC, persist state.
func (cm *ConsensusModule) persistToStorage() {
var termData bytes.Buffer
if err := gob.NewEncoder(&termData).Encode(cm.currentTerm); err != nil { log.Fatal(err) }
cm.storage.Set("currentTerm", termData.Bytes())
// ...same for votedFor and log
}

Evidence: Kill all 3 nodes mid-write. Restart. Log is consistent.

Milestone 5: Apply to a state machine (KV store)

Wire up a key-value store as the replicated state machine. Every committed Raft entry is {op: put|get|delete, key, value} applied to a local hashmap on every node.

Now you have a replicated KV store that survives node failures. This is what etcd and Consul do.

Evidence: From any node, put(foo, bar), then get(foo) from any other node returns bar. Kill a node; reads still work from the others.

Milestone 6: Linearizable client semantics

After a leader fails over, a client's in-flight request may have been committed but never acknowledged. The client retries. Without care, the command is applied twice.

Solution: each client request has a unique ID. The leader records applied IDs. Duplicate requests return the cached result instead of re-applying.

This is what Bendersky's Part 5 covers. Exactly-once semantics.

Evidence: A client retries during a leader failover; the command is applied exactly once.

Milestone 7 (optional, hard): Log compaction / snapshots

Logs grow forever. Eventually you must snapshot the state machine and discard old log entries. Followers that fall too far behind receive a snapshot, not historical entries.

Milestone 8 (optional, hard): Membership changes

Adding or removing a node from the cluster without downtime. Joint consensus (Raft paper §6) is the textbook approach.


8. Tests & evidence

TestHow
Single-nodeTrivially handles its own consensus
3-node steady stateCommands replicate; offsets advance
Leader failureFailover within ~2× election timeout; no data loss
Follower failureCluster continues; failed node recovers state on rejoin
Network partitionMinority side can't make progress; majority side proceeds; on reunification, minority's recent log truncated
PersistenceKill all nodes; restart; state recovers
StressContinuous writes during random kills; final state consistent across all alive nodes
MIT 6.5840 testsPass labs 2A, 2B, 2C, 2D

The single strongest evidence: passing MIT 6.5840's Raft test suite. If your code passes 2A through 2D, it's a real Raft implementation. The test suite catches subtle bugs that hand-rolled tests miss.


9. Common pitfalls

  • Forgetting persistence before responding. Voting twice in the same term across a restart violates safety.
  • Committing entries from previous terms. Raft §5.4.2. The single subtlest correctness rule. Easy to get wrong.
  • Election timeouts not randomized. With identical timeouts, candidates split votes forever (livelock).
  • nextIndex regression. When AppendEntries is rejected, decrement nextIndex and retry. The leader must eventually find the agreement point.
  • Goroutine leaks. Election timers, replication loops — each node has many concurrent loops. Easy to leak.
  • Holding locks during RPC calls. Will deadlock if the peer is also calling you back. Drop the lock before sending RPCs.
  • Confusing leader's commitIndex with applied state. The leader knows what's committed; the state machine knows what's applied. Separate.
  • Reading stale data from followers. Reads must go through the leader or use a "read index" protocol.
  • Trying to optimize before correctness. Bendersky's Part 3 covers optimizations. Don't add them in Part 1.

10. Extensions

  • Read leases for leaders. Avoid an AppendEntries round-trip per read.
  • Batching. Coalesce multiple commands into one AppendEntries.
  • Pipelining. Send AppendEntries N+1 before N's reply arrives.
  • Snapshots. Already noted. Required for production.
  • Membership changes. Already noted. Required for production.
  • Pre-vote. Prevents a partitioned node from disrupting the cluster on rejoin.

A "production-ready" Raft is much more work than a "passes the tests" Raft. etcd's Raft library has been refined for a decade.


11. Module integration

ModuleWhat Raft deepens
Sem 6 Module 3 — Replication & partitioningRaft replicates a single log; sharding partitions data across multiple Rafts.
Sem 6 Module 4 — Transactions & consistencyRaft provides linearizable consistency. Foundation for distributed transactions.
Sem 6 Module 5 — Distributed fundamentalsRaft is the canonical project for this module.
Kafka-like tutorialLeader election in Kafka is now done via Raft (KRaft).
Database (KV) tutorialApply Raft on top → replicated KV store (etcd's design).
Blockchain tutorialNakamoto consensus vs Raft — different families, illuminating contrast.

12. Portfolio framing

What to publish:

  • Source organized as raft/{state,rpc,storage,timer}.go, kv/, client/.
  • A test report with results from MIT 6.5840 labs or equivalent.
  • A README with:
    • State diagram (Follower / Candidate / Leader).
    • A list of corner cases handled.
    • The cluster benchmarks (commit latency, throughput).

Reviewer entry points:

  • raft/state.go — the state machine.
  • raft/election.go — leader election.
  • raft/replication.go — log replication.
  • tests/lab_2c.log — the persistence test results.
  • README must include: state diagram, list of papers consulted, honest scope statement.

A working Raft is one of the most distinguishing portfolio pieces in distributed systems. It is also notoriously a place where people say they've implemented Raft but actually haven't gotten the corner cases right. Passing MIT 6.5840's tests is the credibility marker.


Source

This tutorial draws on the BYO-X catalog (distributed-systems entries) and the broader Raft literature. The Ongaro/Ousterhout paper is the canonical primary source; Eli Bendersky's 5-part series and MIT 6.5840 are the canonical primary tutorial paths.