Chapter 9: Failure Detection

This page is a generated reference surface for selective reading. It exists to keep the learner apps guide-first while still preserving source access.

Learning objectives

Explain the main ideas and vocabulary in Failure Detection.
Work through the source examples for Failure Detection without depending on raw chunk order.
Use Failure Detection as selective reference when learner modules point back to Database Internals.

Prerequisites

Earlier prerequisite concepts leading into Chapter 9: Failure Detection.

Module targets

module-03-replication-partitioning
module-05-distributed-systems-fundamentals

AI companion modes

Explain simply
Socratic tutor
Quiz me
Challenge my understanding
Diagnose my confusion
Generate extra practice
Revision mode
Connect forward / backward

Source-of-truth note

This unit is anchored to Database Internals and the source chapter "Chapter 9: Failure Detection". Use external resources only to clarify, extend, or modernize details without replacing the chapter's conceptual spine.

External enrichment

No chapter-specific enrichment resources are curated yet. Add them in the unit manifest when a source clearly improves learning.

Source provenance

Primary source: Database Internals
Source chapter 09: Chapter 9: Failure Detection
Raw source file: 072-chapter-9-failure-detection.md
Raw source file: 073-phi-accrual-failure-detector.md

Merged source

Chapter 9 Failure Detection

Chapter 9. Failure Detection

If a tree falls in a forest and no one is around to hear it, does it make a sound?

--Unknown Author In order for a system to appropriately react to failures, failures should be detected in a timely manner. A faulty process might get contacted even though it won't be able to respond, increasing latencies and reducing overall system availability. Detecting failures in asynchronous distributed systems (i.e., without making any timing assumptions) is extremely difficult as it's impossible to tell whether the process has crashed, or is running slowly and taking an indefinitely long time to respond. We discussed a problem related to this one in "FLP Impossibility" on page 189. Terms such as dead, failed, and crashed are usually used to describe a process that has stopped executing its steps completely. Terms such as unresponsive, faulty, and slow are used to describe suspected processes, which may actually be dead. Failures may occur on the link level (messages between processes are lost or delivered slowly), or on the process level (the process crashes or is running slowly), and slowness may not always be distinguishable from failure. This means there's always a trade-off between wrongly suspecting alive processes as dead (producing falsepositives), and delaying marking an unresponsive process as dead, giving it the benefit of doubt and expecting it to respond eventually (producing false-negatives). A failure detector is a local subsystem responsible for identifying failed or unreachable processes to exclude them from the algorithm and guarantee liveness while preserving safety.

Liveness and safety are the properties that describe an algorithm's ability to solve a specific problem and the correctness of its output. More formally, liveness is a property that guarantees that a specific intended event must occur. For example, if one of

Heartbeats and Pings

We can query the state of remote processes by triggering one of two periodic processes:

We can trigger a ping, which sends messages to remote processes, checking if they are still alive by expecting a response within a specified time period.
We can trigger a heartbeat when the process is actively notifying its peers that it's still running by sending messages to them.

| Chapter 9: Failure Detection

Timeout-Free Failure Detector

Some algorithms avoid relying on timeouts for detecting failures. For example, Heartbeat, a timeout-free failure detector [AGUILERA97], is an algorithm that only counts heartbeats and allows the application to detect process failures based on the data in the heartbeat counter vectors. Since this algorithm is timeout-free, it operates under asynchronous system assumptions.

The algorithm assumes that any two correct processes are connected to each other with a fair path, which contains only fair links (i.e., if a message is sent over this link

Outsourced Heartbeats

An alternative approach, used by the Scalable Weakly Consistent Infection-style Process Group Membership Protocol (SWIM) [GUPTA01] is to use outsourced heartbeats to improve reliability using information about the process liveness from the perspective of its neighbors. This approach does not require processes to be aware of all other processes in the network, only a subset of connected peers. As shown in Figure 9-3, process P1 sends a ping message to process P2. P2 doesn't respond to the message, so P1 proceeds by selecting multiple random members (P3 and P4). These random members try sending heartbeat messages to P2 and, if it responds, forward acknowledgments back to P1.

| Chapter 9: Failure Detection

Phi Accrual Failure Detector

Phi-Accrual Failure Detector

Instead of treating node failure as a binary problem, where the process can be only in two states: up or down, a phi-accrual (φ-accrual) failure detector [HAYASHIBARA04] has a continuous scale, capturing the probability of the monitored process's crash. It works by maintaining a sliding window, collecting arrival times of the most recent heartbeats from the peer processes. This information is used to approximate arrival time of the next heartbeat, compare this approximation with the actual arrival time, and compute the suspicion level φ: how certain the failure detector is about the failure, given the current network conditions.

The algorithm works by collecting and sampling arrival times, creating a view that can be used to make a reliable judgment about node health. It uses these samples to compute the value of φ: if this value reaches a threshold, the node is marked as down. This failure detector dynamically adapts to changing network conditions by adjusting the scale on which the node can be marked as a suspect. From the architecture perspective, a phi-accrual failure detector can be viewed as a combination of three subsystems:

Monitoring Collecting liveness information through pings, heartbeats, or request-response sampling.

Gossip and Failure Detection

Another approach that avoids relying on a single-node view to make a decision is a gossip-style failure detection service [VANRENESSE98], which uses gossip (see "Gossip Dissemination" on page 250) to collect and distribute states of neighboring processes.

Each member maintains a list of other members, their heartbeat counters, and timestamps, specifying when the heartbeat counter was incremented for the last time. Periodically, each member increments its heartbeat counter and distributes its list to a random neighbor. Upon the message receipt, the neighboring node merges the list with its own, updating heartbeat counters for the other neighbors. Nodes also periodically check the list of states and heartbeat counters. If any node did not update its counter for long enough, it is considered failed. This timeout period should be chosen carefully to minimize the probability of false-positives. How often members have to communicate with each other (in other words, worst-case bandwidth) is capped, and can grow at most linearly with a number of processes in the system.

| Chapter 9: Failure Detection

Reversing Failure Detection Problem Statement

Since propagating the information about failures is not always possible, and propagating it by notifying every member might be expensive, one of the approaches, called FUSE (failure notification service) [DUNAGAN04], focuses on reliable and cheap failure propagation that works even in cases of network partitions. To detect process failures, this approach arranges all active processes in groups. If one of the groups becomes unavailable, all participants detect the failure. In other words, every time a single process failure is detected, it is converted and propagated as a group failure. This allows detecting failures in the presence of any pattern of disconnects, partitions, and node failures.

Processes in the group periodically send ping messages to other members, querying whether they're still alive. If one of the members cannot respond to this message because of a crash, network partition, or link failure, the member that has initiated this ping will, in turn, stop responding to ping messages itself.

Learning objectives​

Prerequisites​

Module targets​

AI companion modes​

Source-of-truth note​

External enrichment​

Source provenance​

Merged source​

Chapter 9 Failure Detection​

Chapter 9. Failure Detection

Heartbeats and Pings​

Timeout-Free Failure Detector​

Outsourced Heartbeats​

Phi Accrual Failure Detector​

Phi-Accrual Failure Detector

Gossip and Failure Detection​

Reversing Failure Detection Problem Statement​

Learning objectives

Prerequisites

Module targets

AI companion modes

Source-of-truth note

External enrichment

Source provenance

Merged source

Chapter 9 Failure Detection

Heartbeats and Pings

Timeout-Free Failure Detector

Outsourced Heartbeats

Phi Accrual Failure Detector

Gossip and Failure Detection

Reversing Failure Detection Problem Statement