Build Your Own Consensus (Raft)

"Paxos is hard. Raft is hard too, but it's hard in places where you can debug it." — Diego Ongaro (paraphrased)

Implementing Raft is the most challenging distributed-systems project you can take on at undergraduate level. It is also the one that pays back the most. After implementing Raft, every distributed system you read about — etcd, Consul, CockroachDB, TiDB, Kafka KRaft, MongoDB — uses fundamentally the same vocabulary you now own.

1. Overview & motivation

Raft solves distributed consensus: getting N nodes to agree on a sequence of values (a replicated log), in the presence of network partitions, message loss, and node crashes — as long as a majority of nodes are alive and connected.

Raft has three subproblems:

Leader election — at any given time, exactly one node is leader. Others are followers.
Log replication — leader receives client commands, appends to its log, replicates to followers.
Safety — once a command is committed (replicated to a majority), it is never lost; once a command is applied to the state machine, no different command will be applied at the same log position.

What you can only learn by building one:

Why majority quorum is the right tradeoff (f+1 of 2f+1 survives f failures).
Why log indexing and term numbers together prevent inconsistency.
Why leader election timeouts must be randomized — without it, livelock.
Why the corner cases of Raft are surprising — committing entries from previous terms, the nextIndex/matchIndex dance, configuration changes.
Why every distributed-systems engineer treats Lamport's "Paxos Made Live" and Ongaro's Raft paper as required reading.

2. Where this fits in the degree

Phase: Architecture
Semester: 6 (Databases and Distributed Systems)
Modules deepened: Module 5 (distributed fundamentals) — this is the canonical distributed systems project.

Cross-phase relevance:

Direct upgrade path from the Kafka-like tutorial — leader election is what Raft does.
Contrast with the Blockchain tutorial — Nakamoto consensus vs Raft are different consensus families.

3. Prerequisites

Strong systems language: Go, Rust, or Java.
Complete the Kafka-like tutorial first — you'll deeply appreciate why leader election matters.
Comfort with state machines.
Patience. Raft is not a weekend project.

4. Theory & research

Required reading

Diego Ongaro & John Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)" (raft.github.io/raft.pdf) — the Raft paper. 18 pages. Print it. Read it three times. ⭐ the canonical source.
Diego Ongaro, "Consensus: Bridging Theory and Practice" (PhD thesis, 2014) — 250 pages. Deeper than the paper. Section 4 covers all the corner cases the paper glosses.

Strongly recommended

Raft visualization (raft.github.io) — interactive web demo. Use it constantly.
The Secret Lives of Data: Raft (thesecretlivesofdata.com/raft/) — animated walkthrough.
MIT 6.5840 (formerly 6.824) Distributed Systems (pdos.csail.mit.edu/6.5840/) — labs include a graded Raft implementation. The single best teaching path. Free recordings on YouTube.
Eli Bendersky, "Implementing Raft" series (eli.thegreenplace.net/2020/implementing-raft-part-1-elections/) — 5-part Go tutorial. ⭐ recommended primary.

Leslie Lamport, "The Part-Time Parliament" (1998) — the original Paxos paper. Famous for being inscrutable.
Lamport, "Paxos Made Simple" (2001) — Lamport's own re-explanation. Still tough.
Chandra, Griesemer, Redstone, "Paxos Made Live" (2007) — Google's actual experience implementing Paxos. Cured the field of "Paxos is simple" claims.

Production Raft implementations

etcd's Raft (github.com/etcd-io/raft) — extracted as a library. Used by Kubernetes, CockroachDB, etc.
Hashicorp Raft (github.com/hashicorp/raft) — used by Consul, Vault, Nomad.

5. Curated tutorial list (from BYO-X)

The BYO-X catalog does not have a dedicated "Consensus" or "Raft" category. Coverage shows up across categories. The recommendations here come from the broader Raft community.

Canonical references

Eli Bendersky, "Implementing Raft" — 5 parts: elections, command processing, persistence and optimizations, key-value DB, exactly-once delivery. ⭐ recommended primary.
MIT 6.5840 labs 2A–2D — the graded Raft labs. The test suite is the gold standard for verifying a Raft implementation.
Jon Gjengset, "Demystifying Consensus" (video lectures + Raft impl in Rust).
Phil Eaton, "Implementing the Raft distributed consensus protocol in Go" (notes.eatonphil.com/2023-05-25-raft.html).

6. Recommended primary path

Eli Bendersky's "Implementing Raft" series (Go), supplemented by the Raft paper.

Five parts:

Elections — leader election only.
Command processing — log replication, simple consensus.
Persistence and optimizations — restart safely, faster recovery.
KV DB — apply Raft to build a replicated KV store.
Exactly-once delivery — handle duplicate client requests after leader failover.

After each part, your implementation should pass the corresponding behaviors.

If you want graded feedback: do MIT 6.5840 labs 2A–2D. Their test suite is brutal and finds bugs in ways your own tests won't.

For Rust: Jon Gjengset's lecture series is well-paced.

7. Implementation milestones

Milestone 1: Leader election (Raft paper §5.2)

Every node is in one of three states: Follower, Candidate, Leader. A node times out → becomes Candidate → requests votes → if majority votes → becomes Leader.

type RaftState int
const (
    Follower RaftState = iota
    Candidate
    Leader
)

type ConsensusModule struct {
    id          int
    peerIds     []int
    currentTerm int
    votedFor    int
    state       RaftState
    log         []LogEntry
    
    electionResetEvent time.Time
    // ...
}

func (cm *ConsensusModule) runElectionTimer() {
    timeoutDuration := cm.electionTimeout()  // 150-300ms randomized
    cm.mu.Lock()
    termStarted := cm.currentTerm
    cm.mu.Unlock()
    
    ticker := time.NewTicker(10 * time.Millisecond)
    defer ticker.Stop()
    for {
        <-ticker.C
        cm.mu.Lock()
        if cm.state != Candidate && cm.state != Follower {
            cm.mu.Unlock()
            return
        }
        if termStarted != cm.currentTerm {
            cm.mu.Unlock()
            return
        }
        if elapsed := time.Since(cm.electionResetEvent); elapsed >= timeoutDuration {
            cm.startElection()
            cm.mu.Unlock()
            return
        }
        cm.mu.Unlock()
    }
}

RequestVote RPC — peer asks for a vote. Grant if voter hasn't voted this term and candidate's log is at least as up-to-date as voter's.

Evidence: Bring up 3 nodes. Within seconds, one is elected leader. Kill the leader; another is elected within seconds.

Milestone 2: Log replication (Raft paper §5.3)

Leader receives client commands, appends to its log, sends AppendEntries to followers with the new entries. Followers append if consistent with their existing log; reject otherwise.

type AppendEntriesArgs struct {
    Term         int
    LeaderId     int
    PrevLogIndex int
    PrevLogTerm  int
    Entries      []LogEntry
    LeaderCommit int
}

Crucial consistency check: a follower accepts entries only if its log already has an entry at PrevLogIndex with term PrevLogTerm. If not, reject. Leader will retry with an earlier PrevLogIndex.

Evidence: Submit a command to the leader; all 3 nodes' logs converge to contain it.

Milestone 3: Commit and apply

Once an entry is replicated to a majority of nodes, leader marks it committed. Leaders track commitIndex; followers learn it via AppendEntries's LeaderCommit field. Each node applies committed entries to its local state machine.

Crucial: a leader may only commit entries from its own term (Raft paper §5.4.2). This is the most subtle correctness rule.

Evidence: A command is acked to the client only after a majority has it. Killing the leader does not lose acked commands.

Milestone 4: Persistence

Persist currentTerm, votedFor, and log to disk before responding to any RPC. Without this, a node that crashes and restarts could vote twice in the same term — breaking safety.

// Before responding to any RPC, persist state.
func (cm *ConsensusModule) persistToStorage() {
    var termData bytes.Buffer
    if err := gob.NewEncoder(&termData).Encode(cm.currentTerm); err != nil { log.Fatal(err) }
    cm.storage.Set("currentTerm", termData.Bytes())
    // ...same for votedFor and log
}

Evidence: Kill all 3 nodes mid-write. Restart. Log is consistent.

Milestone 5: Apply to a state machine (KV store)

Wire up a key-value store as the replicated state machine. Every committed Raft entry is {op: put|get|delete, key, value} applied to a local hashmap on every node.

Now you have a replicated KV store that survives node failures. This is what etcd and Consul do.

Evidence: From any node, put(foo, bar), then get(foo) from any other node returns bar. Kill a node; reads still work from the others.

Milestone 6: Linearizable client semantics

After a leader fails over, a client's in-flight request may have been committed but never acknowledged. The client retries. Without care, the command is applied twice.

Solution: each client request has a unique ID. The leader records applied IDs. Duplicate requests return the cached result instead of re-applying.

This is what Bendersky's Part 5 covers. Exactly-once semantics.

Evidence: A client retries during a leader failover; the command is applied exactly once.

Milestone 7 (optional, hard): Log compaction / snapshots

Logs grow forever. Eventually you must snapshot the state machine and discard old log entries. Followers that fall too far behind receive a snapshot, not historical entries.

Milestone 8 (optional, hard): Membership changes

Adding or removing a node from the cluster without downtime. Joint consensus (Raft paper §6) is the textbook approach.

8. Tests & evidence

Test	How
Single-node	Trivially handles its own consensus
3-node steady state	Commands replicate; offsets advance
Leader failure	Failover within ~2× election timeout; no data loss
Follower failure	Cluster continues; failed node recovers state on rejoin
Network partition	Minority side can't make progress; majority side proceeds; on reunification, minority's recent log truncated
Persistence	Kill all nodes; restart; state recovers
Stress	Continuous writes during random kills; final state consistent across all alive nodes
MIT 6.5840 tests	Pass labs 2A, 2B, 2C, 2D

The single strongest evidence: passing MIT 6.5840's Raft test suite. If your code passes 2A through 2D, it's a real Raft implementation. The test suite catches subtle bugs that hand-rolled tests miss.

9. Common pitfalls

Forgetting persistence before responding. Voting twice in the same term across a restart violates safety.
Committing entries from previous terms. Raft §5.4.2. The single subtlest correctness rule. Easy to get wrong.
Election timeouts not randomized. With identical timeouts, candidates split votes forever (livelock).
nextIndex regression. When AppendEntries is rejected, decrement nextIndex and retry. The leader must eventually find the agreement point.
Goroutine leaks. Election timers, replication loops — each node has many concurrent loops. Easy to leak.
Holding locks during RPC calls. Will deadlock if the peer is also calling you back. Drop the lock before sending RPCs.
Confusing leader's commitIndex with applied state. The leader knows what's committed; the state machine knows what's applied. Separate.
Reading stale data from followers. Reads must go through the leader or use a "read index" protocol.
Trying to optimize before correctness. Bendersky's Part 3 covers optimizations. Don't add them in Part 1.

10. Extensions

Read leases for leaders. Avoid an AppendEntries round-trip per read.
Batching. Coalesce multiple commands into one AppendEntries.
Pipelining. Send AppendEntries N+1 before N's reply arrives.
Snapshots. Already noted. Required for production.
Membership changes. Already noted. Required for production.
Pre-vote. Prevents a partitioned node from disrupting the cluster on rejoin.

A "production-ready" Raft is much more work than a "passes the tests" Raft. etcd's Raft library has been refined for a decade.

11. Module integration

Module	What Raft deepens
Sem 6 Module 3 — Replication & partitioning	Raft replicates a single log; sharding partitions data across multiple Rafts.
Sem 6 Module 4 — Transactions & consistency	Raft provides linearizable consistency. Foundation for distributed transactions.
Sem 6 Module 5 — Distributed fundamentals	Raft is the canonical project for this module.
Kafka-like tutorial	Leader election in Kafka is now done via Raft (KRaft).
Database (KV) tutorial	Apply Raft on top → replicated KV store (etcd's design).
Blockchain tutorial	Nakamoto consensus vs Raft — different families, illuminating contrast.

12. Portfolio framing

What to publish:

Source organized as raft/{state,rpc,storage,timer}.go, kv/, client/.
A test report with results from MIT 6.5840 labs or equivalent.
A README with:
- State diagram (Follower / Candidate / Leader).
- A list of corner cases handled.
- The cluster benchmarks (commit latency, throughput).

Reviewer entry points:

raft/state.go — the state machine.
raft/election.go — leader election.
raft/replication.go — log replication.
tests/lab_2c.log — the persistence test results.
README must include: state diagram, list of papers consulted, honest scope statement.

A working Raft is one of the most distinguishing portfolio pieces in distributed systems. It is also notoriously a place where people say they've implemented Raft but actually haven't gotten the corner cases right. Passing MIT 6.5840's tests is the credibility marker.

Source

This tutorial draws on the BYO-X catalog (distributed-systems entries) and the broader Raft literature. The Ongaro/Ousterhout paper is the canonical primary source; Eli Bendersky's 5-part series and MIT 6.5840 are the canonical primary tutorial paths.

1. Overview & motivation​

2. Where this fits in the degree​

3. Prerequisites​

4. Theory & research​

Required reading​

Strongly recommended​

For context (Paxos and related)​

Production Raft implementations​

5. Curated tutorial list (from BYO-X)​

Canonical references​

6. Recommended primary path​

7. Implementation milestones​

Milestone 1: Leader election (Raft paper §5.2)​

Milestone 2: Log replication (Raft paper §5.3)​

Milestone 3: Commit and apply​

Milestone 4: Persistence​

Milestone 5: Apply to a state machine (KV store)​

Milestone 6: Linearizable client semantics​

Milestone 7 (optional, hard): Log compaction / snapshots​

Milestone 8 (optional, hard): Membership changes​

8. Tests & evidence​

9. Common pitfalls​

10. Extensions​

11. Module integration​

12. Portfolio framing​

Source​