Chapter 9: Failure Detection
This page is a generated reference surface for selective reading. It exists to keep the learner apps guide-first while still preserving source access.
Learning objectives
- Explain the main ideas and vocabulary in Failure Detection.
- Work through the source examples for Failure Detection without depending on raw chunk order.
- Use Failure Detection as selective reference when learner modules point back to Database Internals.
Prerequisites
- Earlier prerequisite concepts leading into Chapter 9: Failure Detection.
Module targets
module-03-replication-partitioningmodule-05-distributed-systems-fundamentals
AI companion modes
- Explain simply
- Socratic tutor
- Quiz me
- Challenge my understanding
- Diagnose my confusion
- Generate extra practice
- Revision mode
- Connect forward / backward
Source-of-truth note
This unit is anchored to Database Internals and the source chapter "Chapter 9: Failure Detection". Use external resources only to clarify, extend, or modernize details without replacing the chapter's conceptual spine.
External enrichment
No chapter-specific enrichment resources are curated yet. Add them in the unit manifest when a source clearly improves learning.
Source provenance
- Primary source:
Database Internals - Source chapter 09: Chapter 9: Failure Detection
- Raw source file:
072-chapter-9-failure-detection.md - Raw source file:
073-phi-accrual-failure-detector.md
Merged source
Chapter 9 Failure Detection
Chapter 9. Failure Detection
If a tree falls in a forest and no one is around to hear it, does it make a sound?
--Unknown Author In order for a system to appropriately react to failures, failures should be detected in a timely manner. A faulty process might get contacted even though it won't be able to respond, increasing latencies and reducing overall system availability. Detecting failures in asynchronous distributed systems (i.e., without making any timing assumptions) is extremely difficult as it's impossible to tell whether the process has crashed, or is running slowly and taking an indefinitely long time to respond. We discussed a problem related to this one in "FLP Impossibility" on page 189. Terms such as dead, failed, and crashed are usually used to describe a process that has stopped executing its steps completely. Terms such as unresponsive, faulty, and slow are used to describe suspected processes, which may actually be dead. Failures may occur on the link level (messages between processes are lost or delivered slowly), or on the process level (the process crashes or is running slowly), and slowness may not always be distinguishable from failure. This means there's always a trade-off between wrongly suspecting alive processes as dead (producing falsepositives), and delaying marking an unresponsive process as dead, giving it the benefit of doubt and expecting it to respond eventually (producing false-negatives). A failure detector is a local subsystem responsible for identifying failed or unreachable processes to exclude them from the algorithm and guarantee liveness while preserving safety.
Liveness and safety are the properties that describe an algorithm's ability to solve a specific problem and the correctness of its output. More formally, liveness is a property that guarantees that a specific intended event must occur. For example, if one of
Heartbeats and Pings
We can query the state of remote processes by triggering one of two periodic processes:
-
We can trigger a ping, which sends messages to remote processes, checking if they are still alive by expecting a response within a specified time period.
-
We can trigger a heartbeat when the process is actively notifying its peers that it's still running by sending messages to them.
| Chapter 9: Failure Detection
Timeout-Free Failure Detector
Some algorithms avoid relying on timeouts for detecting failures. For example, Heartbeat, a timeout-free failure detector [AGUILERA97], is an algorithm that only counts heartbeats and allows the application to detect process failures based on the data in the heartbeat counter vectors. Since this algorithm is timeout-free, it operates under asynchronous system assumptions.
The algorithm assumes that any two correct processes are connected to each other with a fair path, which contains only fair links (i.e., if a message is sent over this link
Outsourced Heartbeats
An alternative approach, used by the Scalable Weakly Consistent Infection-style Process Group Membership Protocol (SWIM) [GUPTA01] is to use outsourced heartbeats to improve reliability using information about the process liveness from the perspective of its neighbors. This approach does not require processes to be aware of all other processes in the network, only a subset of connected peers. As shown in Figure 9-3, process P1 sends a ping message to process P2. P2 doesn't respond to the message, so P1 proceeds by selecting multiple random members (P3 and P4). These random members try sending heartbeat messages to P2 and, if it responds, forward acknowledgments back to P1.
| Chapter 9: Failure Detection
Phi Accrual Failure Detector
Phi-Accrual Failure Detector
Instead of treating node failure as a binary problem, where the process can be only in two states: up or down, a phi-accrual (φ-accrual) failure detector [HAYASHIBARA04] has a continuous scale, capturing the probability of the monitored process's crash. It works by maintaining a sliding window, collecting arrival times of the most recent heartbeats from the peer processes. This information is used to approximate arrival time of the next heartbeat, compare this approximation with the actual arrival time, and compute the suspicion level φ: how certain the failure detector is about the failure, given the current network conditions.
The algorithm works by collecting and sampling arrival times, creating a view that can be used to make a reliable judgment about node health. It uses these samples to compute the value of φ: if this value reaches a threshold, the node is marked as down. This failure detector dynamically adapts to changing network conditions by adjusting the scale on which the node can be marked as a suspect. From the architecture perspective, a phi-accrual failure detector can be viewed as a combination of three subsystems:
Monitoring Collecting liveness information through pings, heartbeats, or request-response sampling.
Gossip and Failure Detection
Another approach that avoids relying on a single-node view to make a decision is a gossip-style failure detection service [VANRENESSE98], which uses gossip (see "Gossip Dissemination" on page 250) to collect and distribute states of neighboring processes.
Each member maintains a list of other members, their heartbeat counters, and timestamps, specifying when the heartbeat counter was incremented for the last time. Periodically, each member increments its heartbeat counter and distributes its list to a random neighbor. Upon the message receipt, the neighboring node merges the list with its own, updating heartbeat counters for the other neighbors. Nodes also periodically check the list of states and heartbeat counters. If any node did not update its counter for long enough, it is considered failed. This timeout period should be chosen carefully to minimize the probability of false-positives. How often members have to communicate with each other (in other words, worst-case bandwidth) is capped, and can grow at most linearly with a number of processes in the system.
| Chapter 9: Failure Detection
Reversing Failure Detection Problem Statement
Since propagating the information about failures is not always possible, and propagating it by notifying every member might be expensive, one of the approaches, called FUSE (failure notification service) [DUNAGAN04], focuses on reliable and cheap failure propagation that works even in cases of network partitions. To detect process failures, this approach arranges all active processes in groups. If one of the groups becomes unavailable, all participants detect the failure. In other words, every time a single process failure is detected, it is converted and propagated as a group failure. This allows detecting failures in the presence of any pattern of disconnects, partitions, and node failures.
Processes in the group periodically send ping messages to other members, querying whether they're still alive. If one of the members cannot respond to this message because of a crash, network partition, or link failure, the member that has initiated this ping will, in turn, stop responding to ping messages itself.