Partial Failure: The Single Defining Property

What This Concept Is

In a single-machine program, failure is binary: the process is running, or it has crashed. In a distributed system, failure is partial. Some components work, others do not, and from any one node's point of view you cannot always tell which is which.

Kleppmann's definition is useful: a distributed system is one where some parts of the system may be broken in some unpredictable way, even though other parts of the system are working fine. The key word is some. You cannot reboot a distributed system the way you reboot a laptop. You cannot "retry from the top" because some side effects have already happened.

Partial failure is the defining property in the sense that it is what makes distributed systems a separate discipline. Take it away and many of the hard problems (consensus, consistency, leader election) collapse.

Why It Matters Here

Every other topic in the module is a response to partial failure:

Logical clocks exist because partial failure invalidates wall-clock timestamps as a global order.
Consensus exists because partial failure means nodes may disagree about what was decided.
Idempotency exists because partial failure means you cannot tell whether a request succeeded.
Coordination services exist because partial failure makes it unsafe for each node to reason alone.

If you model every failure as "the whole system is down," you will design the wrong retry logic, the wrong timeouts, and the wrong recovery.

Concrete Example

You have three services: web, billing, inventory. A user places an order. web calls billing to charge the card, then calls inventory to reserve stock. Consider what happens when the network between web and inventory partitions while billing is already committed:

Card is charged. billing knows this. web knows this.
web retries inventory. Each retry times out because the partition is still there.
inventory is up and healthy from its own point of view - it just cannot hear from web.
From web's point of view, inventory is indistinguishable from a crashed service.

There is no state where "everything is broken." There is no state where "everything is fine." The system is in a correctly-observed inconsistency, and your code must handle it.

Common Confusion / Misconception

"We'll just treat partial failure as total failure and abort the operation." This is not always available. By the time you detect a partial failure, you may have already side-effected something irrecoverable (money moved, email sent, inventory decremented elsewhere). The honest answer is that your design has to carry the notion of "we don't know yet" through the code, not collapse it to success or failure prematurely.

A second misconception: "partial failure is rare." In a cloud environment with thousands of pods behind multiple layers of load balancers and service meshes, partial failure is the normal steady state. Full outages are rare; partial degradation is the weather.

How To Use It

When you design any cross-service operation, label each possible partial-failure outcome explicitly:

Did the request arrive? Unknown.
Was it processed? Unknown.
Did the response come back? Unknown.
For each of those three "unknown"s, what is the state of the world, and what is your client's next legal action?

The answers force idempotency, timeouts with bounded retries, and explicit reconciliation or compensation paths.

Check Yourself

Why is partial failure specifically a distributed-systems property rather than a general fault-tolerance property?
You send a request, get no response, and retry. Name three states your peer could be in during the retry.
Why is it wrong to model "no response" as "the peer is down"?

Mini Drill or Application

Take any service call in your own codebase. Write out the three "unknown" states above, and what your current code does in each. You will almost certainly find at least one case where the code assumes a state it cannot actually observe.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​