Leader Election and Split-Brain Prevention
What This Concept Is
Leader election is the mechanism by which a group of nodes agrees on exactly one leader for some purpose (write primary, scheduler, shard owner). Split-brain is the failure mode where two or more nodes simultaneously believe they are leader and issue conflicting writes.
Three standard techniques, increasingly robust:
- Quorum-based election. The leader is the node that has a majority of votes in a term. Because any two majorities intersect, only one candidate can be elected per term. This is what Raft and Paxos use.
- Leases. The leader holds a time-bounded lease from a coordination service. When the lease expires, another node can take leadership. This bounds the window in which two leaders can coexist (and you must respect this bound).
- Fencing tokens. Every write includes a monotonically-increasing token; storage rejects writes whose token is older than the latest seen. This protects the storage layer from a stale leader even if leader election is briefly confused.
In real systems, you combine these. Raft gives you quorum-based election with terms. Quorum + terms acts as an implicit fencing: storage backends check that the leader's term is current.
Why It Matters Here
Split-brain is the canonical correctness failure in leader-based systems. It is the reason why:
- "I used a lock to pick the leader" is not enough (the lock can be taken by a partitioned process that assumed the old holder died).
- "I used a GCD TTL" is not enough (the process holding the TTL can pause for longer than the TTL).
- "I used Kubernetes for leader election" is only enough if your code correctly propagates the fencing token on every write.
This concept ties the election mechanism (Raft, lease) to the write path (storage that respects fencing).
Concrete Example: The Martin Kleppmann Redlock Scenario
A paused leader can cause harm even with a correct lock service.
The fencing token is the essential piece. Without it, the old leader's write would overwrite the new leader's write, and both leaders' clients would see inconsistent results. With the fencing token, storage is the final arbiter.
Common Confusion / Misconception
"If the leader crashes, the followers detect it and elect a new leader - that's all we need." That describes election. It does not describe what happens when the old leader is not crashed but paused, partitioned, or flapping. Without fencing on writes, both the old and new leader will happily issue writes during the transition window.
A second misconception: "Redlock solves distributed locking." Kleppmann's 2016 critique (and subsequent discussion with Salvatore Sanfilippo) showed that naive Redlock does not prevent split-brain under pauses unless combined with fencing tokens in the storage layer. The lock service alone is not sufficient.
A third: "TTL-based leases are safe if I make the TTL long enough." No. A long TTL reduces the probability of the race but never eliminates it, because an arbitrary-length pause is always possible under asynchrony. Fencing is the structural fix; TTL is a tuning knob.
How To Use It
When you design a leader-based service:
- Pick the election mechanism (Raft via etcd/ZooKeeper/Consul; or a cloud-provider-managed primitive).
- Record the term or epoch or token the leader was elected at. This is the fencing token.
- Propagate the token on every write the leader makes to shared storage.
- Have the storage layer reject any write with a token older than the latest it has seen.
- When the leader steps down, have it stop issuing writes before being observed as dead (stop accepting requests, drain in-flight ones - "graceful leave").
Check Yourself
- What is split-brain, and why do timeouts alone fail to prevent it?
- Why do fencing tokens need to be monotonically increasing?
- Why is "elect a leader with a distributed lock" insufficient without fencing?
- What does Raft's
termnumber act as, with respect to this concept?
Mini Drill or Application
Design leader election for a cron-like service that must run each scheduled job exactly once. Specify: where does the leader's term come from? What token does it attach to a job completion write? How does storage prevent an old leader from marking a job "done" twice?