Chapter 10: Leader Election

This page is a generated reference surface for selective reading. It exists to keep the learner apps guide-first while still preserving source access.

Learning objectives

Explain the main ideas and vocabulary in Leader Election.
Work through the source examples for Leader Election without depending on raw chunk order.
Use Leader Election as selective reference when learner modules point back to Database Internals.

Prerequisites

Earlier prerequisite concepts leading into Chapter 10: Leader Election.

Module targets

module-03-replication-partitioning
module-05-distributed-systems-fundamentals

AI companion modes

Explain simply
Socratic tutor
Quiz me
Challenge my understanding
Diagnose my confusion
Generate extra practice
Revision mode
Connect forward / backward

Source-of-truth note

This unit is anchored to Database Internals and the source chapter "Chapter 10: Leader Election". Use external resources only to clarify, extend, or modernize details without replacing the chapter's conceptual spine.

External enrichment

No chapter-specific enrichment resources are curated yet. Add them in the unit manifest when a source clearly improves learning.

Source provenance

Primary source: Database Internals
Source chapter 10: Chapter 10: Leader Election
Raw source file: 075-chapter-10-leader-election.md
Raw source file: 076-bully-algorithm.md
Raw source file: 077-invitation-algorithm.md

Merged source

Chapter 10 Leader Election

Chapter 10. Leader Election

Synchronization can be quite costly: if each algorithm step involves contacting each other participant, we can end up with a significant communication overhead. This is particularly true in large and geographically distributed networks. To reduce synchronization overhead and the number of message round-trips required to reach a decision, some algorithms rely on the existence of the leader (sometimes called coordinator) process, responsible for executing or coordinating steps of a distributed algorithm.

Generally, processes in distributed systems are uniform, and any process can take over the leadership role. Processes assume leadership for long periods of time, but this is not a permanent role. Usually, the process remains a leader until it crashes. After the crash, any other process can start a new election round, assume leadership, if it gets elected, and continue the failed leader's work. The liveness of the election algorithm guarantees that most of the time there will be a leader, and the election will eventually complete (i.e., the system should not be in the election state indefinitely).

Ideally, we'd like to assume safety, too, and guarantee there may be at most one leader at a time, and completely eliminate the possibility of a split brain situation (when two leaders serving the same purpose are elected but unaware of each other). However, in practice, many leader election algorithms violate this agreement. Leader processes can be used, for example, to achieve a total order of messages in a broadcast. The leader collects and holds the global state, receives messages, and disseminates them among the processes. It can also be used to coordinate system reorganization after the failure, during initialization, or when important state changes happen.

Election is triggered when the system initializes, and the leader is elected for the first time, or when the previous leader crashes or fails to communicate. Election has to be deterministic: exactly one leader has to emerge from the process. This decision needs to be effective for all participants. Even though leader election and distributed locking (i.e., exclusive ownership over a shared resource) might look alike from a theoretical perspective, they are slightly different. If one process holds a lock for executing a critical section, it is unimportant for other processes to know who exactly is holding a lock right now, as long as the liveness property is satisfied (i.e., the lock will be eventually released, allowing others to acquire it). In contrast, the elected process has some special properties and has to be known to all other participants, so the newly elected leader has to notify its peers about its role.

If a distributed locking algorithm has any sort of preference toward some process or group of processes, it will eventually starve nonpreferred processes from the shared resource, which contradicts the liveness property. In contrast, the leader can remain in its role until it stops or crashes, and long-lived leaders are preferred. Having a stable leader in the system helps to avoid state synchronization between remote participants, reduce the number of exchanged messages, and drive execution from a single process instead of requiring peer-to-peer coordination. One of the potential problems in systems with a notion of leadership is that the leader can become a bottleneck. To overcome that, many systems partition data in nonintersecting independent replica sets (see "Database Partitioning" on page 270). Instead of having a single system-wide leader, each replica set has its own leader. One of the systems that uses this approach is Spanner (see "Distributed Transactions with Spanner" on page 268).

Because every leader process will eventually fail, failure has to be detected, reported, and reacted upon: a system has to elect another leader to replace the failed one. Some algorithms, such as ZAB (see "Zookeeper Atomic Broadcast (ZAB)" on page 283), Multi-Paxos (see "Multi-Paxos" on page 291), or Raft (see "Raft" on page 300), use temporary leaders to reduce the number of messages required to reach an agreement between the participants. However, these algorithms use their own algorithmspecific means for leader election, failure detection, and resolving conflicts between the competing leader processes.

| Chapter 10: Leader Election

Bully Algorithm

One of the leader election algorithms, known as the bully algorithm, uses process ranks to identify the new leader. Each process gets a unique rank assigned to it. During the election, the process with the highest rank becomes a leader [MOLINA82]. This algorithm is known for its simplicity. The algorithm is named bully because the highest-ranked node "bullies" other nodes into accepting it. It is also known as monarchial leader election: the highest-ranked sibling becomes a monarch after the previous one ceases to exist.

Election starts if one of the processes notices that there's no leader in the system (it was never initialized) or the previous leader has stopped responding to requests, and proceeds in three steps:1 1. The process sends election messages to processes with higher identifiers. 2. The process waits, allowing higher-ranked processes to respond. If no higherranked process responds, it proceeds with step 3. Otherwise, the process notifies the highest-ranked process it has heard from, and allows it to proceed with step 3.

The process assumes that there are no active processes with a higher rank, and notifies all lower-ranked processes about the new leader.

Figure 10-1 illustrates the bully leader election algorithm:

a) Process 3 notices that the previous leader 6 has crashed and starts a new election by sending Election messages to processes with higher identifiers.
b) 4 and 5 respond with Alive, as they have a higher rank than 3.
c) 3 notifies the highest-ranked process 5 that has responded during this round.
d) 5 is elected as a new leader. It broadcasts Elected messages, notifying lowerranked processes about the election results.

1 These steps describe the modified bully election algorithm [KORDAFSHARI05] as it's more compact and clear.

Next-In-Line Failover

There are many versions of the bully algorithm that improve its various properties. For example, we can use multiple next-in-line alternative processes as a failover to shorten reelections [GHOLIPOUR09].

Each elected leader provides a list of failover nodes. When one of the processes detects a leader failure, it starts a new election round by sending a message to the highest-ranked alternative from the list provided by the failed leader. If one of the proposed alternatives is up, it becomes a new leader without having to go through the complete election round.

If the process that has detected the leader failure is itself the highest ranked process from the list, it can notify the processes about the new leader right away.

Figure 10-2 shows the process with this optimization in place:

a) 6, a leader with designated alternatives {5,4}, crashes. 3 notices this failure and contacts 5, the alternative from the list with the highest rank.

| Chapter 10: Leader Election

Candidate/Ordinary Optimization

Another algorithm attempts to lower requirements on the number of messages by splitting the nodes into two subsets, candidate and ordinary, where only one of the candidate nodes can eventually become a leader [MURSHED12]. The ordinary process initiates election by contacting candidate nodes, collecting responses from them, picking the highest-ranked alive candidate as a new leader, and then notifying the rest of the nodes about the election results. To solve the problem with multiple simultaneous elections, the algorithm proposes to use a tiebreaker variable δ, a process-specific delay, varying significantly between the nodes, that allows one of the nodes to initiate the election before the other ones. The tiebreaker time is generally greater than the message round-trip time. Nodes with higher priorities have a lower δ, and vice versa.

Figure 10-3 shows the steps of the election process:

a) Process 4 from the ordinary set notices the failure of leader process 6. It starts a new election round by contacting all remaining processes from the candidate set.
b) Candidate processes respond to notify 4 that they're still alive.
c) 4 notifies all processes about the new leader: 2.

Invitation Algorithm

An invitation algorithm allows processes to "invite" other processes to join their groups instead of trying to outrank them. This algorithm allows multiple leaders by definition, since each group has its own leader. Each process starts as a leader of a new group, where the only member is the process itself. Group leaders contact peers that do not belong to their groups, inviting them to join. If the peer process is a leader itself, two groups are merged. Otherwise, the contacted process responds with a group leader ID, allowing two group leaders to establish contact and merge groups in fewer steps.

Figure 10-4 shows the execution steps of the invitation algorithm:

a) Four processes start as leaders of groups containing one member each. 1 invites 2 to join its group, and 3 invites 4 to join its group.
b) 2 joins a group with process 1, and 4 joins a group with process 3. 1, the leader of the first group, contacts 3, the leader of the other group. Remaining group members (4, in this case) are notified about the new group leader.
c) Two groups are merged and 1 becomes a leader of an extended group.

Figure 10-4. Invitation algorithm

| Chapter 10: Leader Election

Ring Algorithm

In the ring algorithm [CHANG79], all nodes in the system form a ring and are aware of the ring topology (i.e., their predecessors and successors in the ring). When the process detects the leader failure, it starts the new election. The election message is forwarded across the ring: each process contacts its successor (the next node closest to it in the ring). If this node is unavailable, the process skips the unreachable node and attempts to contact the nodes after it in the ring, until eventually one of them responds.

Nodes contact their siblings, following around the ring and collecting the live node set, adding themselves to the set before passing it over to the next node, similar to the failure-detection algorithm described in "Timeout-Free Failure Detector" on page 197, where nodes append their identifiers to the path before passing it to the next node.

The algorithm proceeds by fully traversing the ring. When the message comes back to the node that started the election, the highest-ranked node from the live set is chosen as a leader. In Figure 10-5, you can see an example of such a traversal:

a) Previous leader 6 has failed and each process has a view of the ring from its perspective.
b) 3 initiates an election round by starting traversal. On each step, there's a set of nodes traversed on the path so far. 5 can't reach 6, so it skips it and goes straight to 1.
c) Since 5 was the node with the highest rank, 3 initiates another round of messages, distributing the information about the new leader.

Learning objectives​

Prerequisites​

Module targets​

AI companion modes​

Source-of-truth note​

External enrichment​

Source provenance​

Merged source​

Chapter 10 Leader Election​

Chapter 10. Leader Election

Bully Algorithm​

Bully Algorithm

Next-In-Line Failover​

Candidate/Ordinary Optimization​

Invitation Algorithm​

Invitation Algorithm

Ring Algorithm​

Learning objectives

Prerequisites

Module targets

AI companion modes

Source-of-truth note

External enrichment

Source provenance

Merged source

Chapter 10 Leader Election

Bully Algorithm

Next-In-Line Failover

Candidate/Ordinary Optimization

Invitation Algorithm

Ring Algorithm