Failover and Split-Brain
What This Concept Is
Failover is the process of replacing a failed leader with a follower so writes can resume. It sounds simple; it is not. Every step of it can go wrong.
Split-brain is the pathological outcome when failover promotes a new leader while the old leader is still alive and thinks it is the leader. Now two nodes accept writes for the same keys, and the database has two diverging histories of the same rows. Untangling the damage is manual and expensive.
The mechanisms that prevent split-brain:
- Quorum-based election: the new leader requires a majority vote among replicas. If the old leader is in the minority of a partition, it cannot achieve quorum and cannot continue accepting writes.
- Fencing tokens: every leader is given a monotonically increasing token. Storage servers refuse to honor writes from older tokens. The old leader's writes are rejected even if it tries.
- STONITH ("shoot the other node in the head"): the failover controller forcibly powers off or isolates the old leader (via IPMI, cloud API, or network kill) before promoting the new one. Used in hardware-HA clusters (Pacemaker, Corosync).
Failover (happy path): Split-brain (bad path):
L F1 F2 L (still alive, partitioned) F1* (promoted by F2)
L dies L thinks it is leader F1 thinks it is leader
X F1 F2 Clients in L's segment write Clients in F1's segment write
\ /
vote Network heals ->
F1* promoted two divergent histories
manual merge or data loss
Why It Matters Here
Failover is the single most operationally complex moment in a replicated system. It combines every topic in this module:
- The sync mode (Cluster 3.8) determines how much data is at risk.
- The replication lag (Cluster 3.9) determines which follower can be promoted safely.
- The topology (Cluster 2) determines whether quorum can be reached.
- The routing (Cluster 5.13) determines how clients discover the new leader.
A team that cannot walk through a failover end-to-end, naming every place split-brain could enter, will eventually suffer split-brain in production.
Concrete Example -- A PostgreSQL Failover Timeline
T = 0 s Primary "pg-1" is healthy; replicas pg-2, pg-3 are in sync.
Patroni's distributed lock (in etcd) is held by pg-1.
T = 0 s pg-1's machine experiences a kernel panic.
T = 2 s Patroni on pg-1 cannot renew its etcd lease; lease expires.
T = 3 s etcd's lease has expired; lock is released.
T = 4 s Patroni on pg-2 and pg-3 see the lock is free.
T = 5 s They run an election: pg-2 has the higher LSN.
pg-2 acquires the etcd lock.
T = 5 s pg-2 executes pg_promote() -> becomes writable.
T = 6 s Patroni reconfigures HAProxy; new writes route to pg-2.
T = 6 s Clients see connection errors retry, then reconnect to pg-2.
T = 6 s+ Writes resume.
Meanwhile: pg-1 eventually reboots.
T = 120 s pg-1 boots, Patroni starts, sees the lock is held by pg-2.
Patroni reinitializes pg-1 as a follower of pg-2 (pg_rewind or basebackup).
Where split-brain could have entered:
- If Patroni did not hold an etcd lock, pg-1 could have continued accepting writes after its network came back.
- If the failover tool did not use a fencing token against the storage layer, the old primary could have written WAL records the new primary refused to accept, causing a "two heads" scenario.
- If etcd itself lost quorum during the failover, the system could have had multiple concurrent leaders.
Common Confusion / Misconception
"Failover is instant." Typical failover takes 5-30 seconds for detection (timeout vs. real failure) plus election plus reconfiguration. Any application that cannot tolerate seconds of write unavailability needs a stricter design (active-active with conflict resolution, not failover).
"If the old leader is unreachable, it cannot cause split-brain." It can if it is merely unreachable to the new leader but reachable to some clients (the classic "network partition fools some observers"). Fencing at the storage layer is the only robust prevention.
"We use cloud-managed databases, so we don't deal with this." Cloud-managed databases deal with it for you, but they also have incident reports where their failover misbehaved. Knowing what to look for in the postmortem is the point.
How To Use It
A failover-ready design answers all of:
- Who decides the leader is dead? (timeout, health check, ZooKeeper lease)
- How long does detection take, and how long before a new leader is active?
- Which follower is promoted? (highest LSN, operator choice, round-robin)
- What data is lost? (any un-replicated writes on the failed leader)
- How is the old leader prevented from causing split-brain? (fencing, STONITH, quorum)
- How do clients discover the new leader? (DNS, proxy, client reconfiguration)
Write a failover playbook. Run a game day. The first time you see the timeline, you want it to be in staging.
Check Yourself
- Define split-brain. Give a concrete example.
- Name three mechanisms that prevent split-brain and say what each blocks.
- Why does a lease-based lock (etcd/ZooKeeper) prevent an old leader from staying leader?
- What does "promote the follower with the highest LSN" buy you, and when does it fail?
Mini Drill or Application
For each failure, identify whether split-brain is possible and the mechanism that prevents it:
- PostgreSQL primary loses network to replicas and to etcd. Patroni on replicas elects a new primary.
- MongoDB primary loses network to the majority of the replica set.
- Cassandra node becomes unresponsive but comes back in 30 s.
- A three-datacenter replicated cluster is cleanly split: 2 DCs on one side, 1 DC on the other.
- A zookeeper cluster itself loses quorum during a leader-election window.