Partial Failure: The Single Defining Property
What This Concept Is
In a single-machine program, failure is binary: the process is running, or it has crashed. In a distributed system, failure is partial. Some components work, others do not, and from any one node's point of view you cannot always tell which is which.
Kleppmann's definition is useful: a distributed system is one where some parts of the system may be broken in some unpredictable way, even though other parts of the system are working fine. The key word is some. You cannot reboot a distributed system the way you reboot a laptop. You cannot "retry from the top" because some side effects have already happened.
Partial failure is the defining property in the sense that it is what makes distributed systems a separate discipline. Take it away and many of the hard problems (consensus, consistency, leader election) collapse.
Why It Matters Here
Every other topic in the module is a response to partial failure:
- Logical clocks exist because partial failure invalidates wall-clock timestamps as a global order.
- Consensus exists because partial failure means nodes may disagree about what was decided.
- Idempotency exists because partial failure means you cannot tell whether a request succeeded.
- Coordination services exist because partial failure makes it unsafe for each node to reason alone.
If you model every failure as "the whole system is down," you will design the wrong retry logic, the wrong timeouts, and the wrong recovery.
Concrete Example
You have three services: web, billing, inventory. A user places an order. web calls billing to charge the card, then calls inventory to reserve stock. Consider what happens when the network between web and inventory partitions while billing is already committed:
- Card is charged.
billingknows this.webknows this. webretriesinventory. Each retry times out because the partition is still there.inventoryis up and healthy from its own point of view - it just cannot hear fromweb.- From
web's point of view,inventoryis indistinguishable from a crashed service.
There is no state where "everything is broken." There is no state where "everything is fine." The system is in a correctly-observed inconsistency, and your code must handle it.
Common Confusion / Misconception
"We'll just treat partial failure as total failure and abort the operation." This is not always available. By the time you detect a partial failure, you may have already side-effected something irrecoverable (money moved, email sent, inventory decremented elsewhere). The honest answer is that your design has to carry the notion of "we don't know yet" through the code, not collapse it to success or failure prematurely.
A second misconception: "partial failure is rare." In a cloud environment with thousands of pods behind multiple layers of load balancers and service meshes, partial failure is the normal steady state. Full outages are rare; partial degradation is the weather.
How To Use It
When you design any cross-service operation, label each possible partial-failure outcome explicitly:
- Did the request arrive? Unknown.
- Was it processed? Unknown.
- Did the response come back? Unknown.
- For each of those three "unknown"s, what is the state of the world, and what is your client's next legal action?
The answers force idempotency, timeouts with bounded retries, and explicit reconciliation or compensation paths.
Check Yourself
- Why is partial failure specifically a distributed-systems property rather than a general fault-tolerance property?
- You send a request, get no response, and retry. Name three states your peer could be in during the retry.
- Why is it wrong to model "no response" as "the peer is down"?
Mini Drill or Application
Take any service call in your own codebase. Write out the three "unknown" states above, and what your current code does in each. You will almost certainly find at least one case where the code assumes a state it cannot actually observe.