Skip to main content

Asynchrony: Slow vs Dead Is Undecidable

What This Concept Is

An asynchronous distributed system is one where no upper bound exists on message delivery time, clock drift, or process execution speed. Any node can pause arbitrarily (GC pause, hypervisor steal, disk stall, kernel page fault), and any message can be delayed arbitrarily (congestion, rerouting, retransmit). This is the honest default model for the internet and for cloud networks.

In such a system, when you send a request and do not get a reply after some time, you cannot tell whether:

  • the request was lost,
  • the request was delivered but is queued,
  • the peer is processing it but slowly,
  • the peer has crashed,
  • the reply was lost, or
  • the reply is on the way but you gave up too soon.

This is not a hypothetical. It is a mathematical consequence of the asynchronous model. A timeout is always a guess, not a detection.

Why It Matters Here

The entire rest of the module is built on top of this fact:

  • Consensus in a fully asynchronous network with even one faulty process is impossible (FLP). We will meet this in Cluster 4.
  • Failure detectors exist precisely because we need a heuristic for something that is not decidable.
  • Fencing tokens exist because a "dead" leader may actually just be paused and about to wake up.
  • Every retry-with-timeout design is trading false-positive detection against latency.

If you think a timeout detects a dead process, you will write split-brain bugs. A timeout only suspects.

Concrete Example

A 1-second stop-the-world JVM GC pause on a Raft leader. During the pause:

  1. Followers stop receiving heartbeats.
  2. After the heartbeat timeout (say 150ms), a follower becomes a candidate and starts a new election.
  3. The new leader begins accepting writes at term N+1.
  4. The old leader wakes up, still thinking it is the leader at term N, and happily tries to commit a write.

Without a fencing check on the write path, two leaders have now written. This is not a bug in the consensus algorithm; it is a bug in how the application treats the leader's identity. The old leader was not dead. It was slow. The algorithm treated it as dead. The difference matters.

Common Confusion / Misconception

"I'll just use a longer timeout to be safe." A longer timeout reduces false positives (declaring a slow peer dead) but increases detection latency, and does not change the fundamental undecidability. There is no timeout value at which "slow" and "dead" become distinguishable - only values that trade one error for the other.

A related misconception: "If the OS says the TCP connection is broken, the peer is down." TCP's RST or RESET can fire for many reasons: the connection was torn down, an intermediate box dropped state, a firewall aged out a flow. The peer application may be alive and unaware.

How To Use It

When you design any detector, quorum, or leader mechanism:

  1. Write down what you are suspecting, not what you are detecting.
  2. For every timeout, explicitly state the error you accept when it fires (false positive: healthy peer declared dead).
  3. Require a second mechanism (quorum vote, fencing token, lease expiry) to commit the assumption that a peer is dead.
  4. Never design around "the peer will notice it has been replaced" - the paused peer cannot notice.

Check Yourself

  1. Why is "the peer did not respond in 500ms" evidence of nothing except that no response arrived in 500ms?
  2. Name two ways a healthy process can appear dead to its peers.
  3. Why does the FLP impossibility result follow intuitively from this fact?

Mini Drill or Application

Write a 5-line pseudo-code routine that "detects" whether a peer is alive using a timeout. Then annotate every assumption about asynchrony it makes. You will find at least three.

Read This Only If Stuck