Skip to main content

Heartbeats, Timeouts, and Phi-Accrual Detectors

What This Concept Is

A failure detector is a mechanism that lets a process suspect that another process has failed. Because slow is indistinguishable from dead in an asynchronous network (Concept 3), all failure detectors are heuristics. Two standard quality metrics:

  • Completeness: every actually-dead process is eventually suspected by every correct process.
  • Accuracy: no correct process suspects a process that is actually alive.

A perfect detector (strong completeness and strong accuracy) would solve consensus, which we know from FLP is impossible in a fully asynchronous system. So all real detectors trade one off against the other.

Three standard styles:

  1. Heartbeat with fixed timeout. Peer sends "I am alive" every T_h ms. If we receive no heartbeat for T_o ms, we suspect it. Simple, coarse, and the one most systems ship. Tuning is hard: T_o too small -> false positives; too large -> slow detection.
  2. Heartbeat with adaptive timeout. Track recent arrival intervals; set the timeout to mean + k·stddev. Better under bursty networks but still a hard threshold.
  3. Phi-accrual detector. Instead of a yes/no decision, produce a continuous suspicion value φ(t) that increases as more time passes without a heartbeat. Let the application choose the threshold for acting. Cassandra and Akka use this.

Why It Matters Here

Failure detection is the input to every higher-level algorithm that needs to know "who is here right now": consensus (Cluster 4) needs to elect a new leader when the old one is suspected; membership (next concept) needs to add and remove peers; partitioning needs to reassign work when a shard owner is gone. A bad detector causes cascading incidents: flapping leaders, thrashing shards, and false split-brain.

Concrete Example

Cassandra uses phi-accrual. The formula (simplified) is:

  • Maintain a sliding window of last N inter-arrival times of heartbeats.
  • Fit a distribution (typically Normal) to that window.
  • When a heartbeat is late by t ms, compute P(next arrival > t | distribution).
  • φ = -log10(P).

Interpretation: φ = 1 means roughly 10% chance the peer is just late; φ = 3 means 0.1%; φ = 8 means 10⁻⁸. The operator picks a threshold, commonly 8. This lets the system adapt to a link that normally carries a heartbeat every 1s but occasionally takes 3s - a fixed 2s timeout would false-positive that peer constantly.

Common Confusion / Misconception

"Heartbeats detect crashes." They detect the absence of heartbeats, which can be caused by: crashes, GC pauses, network partitions, overloaded receivers (the heartbeat arrived but we are too busy to read the socket), and TCP retransmit storms. A healthy peer on the other side of a saturated NIC looks identical to a dead peer on the wire.

A second misconception: "Lower timeout is better." Lower timeout is faster detection, but more false positives. In a Raft cluster, false-positive leader death triggers an election; too many elections cause no progress. You want the detector tuned to the 99.9th percentile of your healthy network RTT plus jitter plus GC pauses, not to the median.

A third: "Phi-accrual is magic." It is not. It is still a threshold decision; the phi value just makes that threshold meaningful across different network conditions.

How To Use It

When tuning any failure detector:

  1. Profile the healthy RTT distribution for at least a few hours.
  2. Measure the 99th and 99.9th percentile, including any GC or scheduler pauses on the receiving side.
  3. For fixed timeouts: set T_o to roughly 3x the 99.9th percentile. For adaptive: set the phi threshold by the false-positive rate you can tolerate (phi=8 for most systems is a reasonable default).
  4. Separate the detector from the decision. Raft uses its own election timeout distinct from any detector; separate concerns so operators can tune one without breaking the other.
  5. Always log the transition from "suspected" to "confirmed dead" and include the evidence. Flapping is a symptom, not a conclusion.

Check Yourself

  1. What is the difference between completeness and accuracy?
  2. Why cannot a failure detector be both strongly complete and strongly accurate in an asynchronous network?
  3. What does φ measure?
  4. Why is a fixed 500ms timeout usually wrong in a cloud environment?

Mini Drill or Application

You are running a Raft cluster on a cloud provider with typical p50 inter-node RTT of 1ms and p99 of 50ms. JVM GCs on the follower occasionally pause for 300ms. Pick an election timeout range that avoids false positives, justify it, and explain how phi-accrual would let you react faster when the link is actually healthy.

Read This Only If Stuck