Byzantine Failures and the Trust Model
What This Concept Is
Failure models classify how a faulty process can misbehave. The common hierarchy, from most to least restrictive:
- Crash-stop (fail-stop): a faulty process simply stops. Once stopped, it emits no messages and never recovers. Easiest to reason about; default in Raft and most practical consensus.
- Crash-recovery: a faulty process may stop and later recover, possibly losing in-memory state. Persistent log required for safety.
- Omission: a faulty process may fail to send or receive some messages but otherwise follows the protocol. This captures many real network and overload faults.
- Byzantine (arbitrary): a faulty process may do anything - send conflicting messages to different peers, forge signatures (if none required), collude with other faulty nodes, or lie about its state. The name comes from Lamport's Byzantine Generals Problem (1982).
Connection to this module (SUPPORTING). The primary consensus algorithms you will study (Paxos, Raft) assume crash failures. They are not safe under Byzantine behavior. Byzantine-fault-tolerant (BFT) algorithms like PBFT (Castro & Liskov, 1999) require a much larger quorum - specifically, tolerating f Byzantine nodes requires n ≥ 3f + 1 total nodes, compared to n ≥ 2f + 1 for crash faults.
The reason: in crash-fault consensus, a majority (f+1 out of 2f+1) agreeing is enough because the only unknown is which nodes have crashed. Under Byzantine fault, you need enough honest nodes to out-vote colluding liars and to detect contradictions - hence the extra factor.
Why It Matters Here
Most engineers never implement a Byzantine-fault-tolerant system. But the trust model question ("what could a faulty node do?") is always part of your design:
- Inside one datacenter under one administrator: crash-stop is usually enough.
- Across a multi-tenant cloud with potential misconfiguration but no malice: crash + omission.
- Across administrative domains without mutual trust (blockchain, some cross-organization systems): Byzantine, and you need BFT or a different architecture.
- With compromised or corrupted nodes as part of the threat model (supply-chain attack, malicious operator): Byzantine.
Choosing a stronger failure model than you need costs you 3× more nodes and much more complexity. Choosing a weaker one than you need gives you no guarantees when the assumption is violated.
Concrete Example
A 3-node Raft cluster. The leader gets silently corrupted (a cosmic-ray bit flip, or worse, a malicious operator) and starts sending different log entries to different followers - entry A to follower 1 and entry B to follower 2, both at the same log index.
Under the Raft model: this cannot happen. A correct leader sends the same entry to all followers. The algorithm assumes honest-but-possibly-crashed nodes.
Under Byzantine model: this can happen, and Raft's majority-commit rule can be exploited. Follower 1 and the leader commit A; follower 2 sees B; a client reading from follower 2 gets an inconsistent answer. No safety argument in Raft's paper prevents this.
PBFT solves this by requiring a 2f+1 match in a 3f+1 cluster for each phase, plus message signing and cross-validation among replicas. The cost: 4 nodes to tolerate 1 Byzantine failure (instead of 3 for 1 crash failure), and more message rounds.
Common Confusion / Misconception
"Byzantine just means 'weird failures we didn't anticipate.'" No. It specifically means a faulty node may behave arbitrarily, including sending inconsistent messages to different peers. Weird-but-consistent failures (e.g., slow, lossy, crash-recovery) are captured by weaker models.
A second misconception: "TLS solves Byzantine." TLS makes in-flight messages tamper-resistant and authenticates the peer, but it does not protect against a fully-authenticated peer that is itself compromised or buggy. BFT is about distrusting the endpoint, not just the wire.
A third: "Blockchain == BFT == solved." Blockchains like Bitcoin use probabilistic, proof-of-work Byzantine-tolerance which is different from PBFT. They assume the honest majority controls more than 50% of computational power and pay a latency cost (minutes) for this. Classical PBFT-family protocols give deterministic finality in rounds but require a fixed membership.
How To Use It
When you design a system:
- Write down the trust boundary: who controls each node? Are they mutually cooperating, mutually distrustful, or somewhere in between?
- Pick the weakest failure model that honestly covers your threat set. Usually crash-stop for intra-cluster, omission for across WAN, Byzantine only if you must.
- If Byzantine is on the table, budget for the 3f+1 cost and for message signing, and strongly prefer an existing BFT library over rolling your own.
- Document the model. When a future engineer widens the deployment to an untrusted environment, the model tells them what assumptions break.
Check Yourself
- What is the defining difference between omission and Byzantine failure?
- Why does Byzantine consensus require 3f+1 instead of 2f+1?
- Give one environment where Byzantine is the honest failure model and one where crash-stop is.
- Why does TLS not solve Byzantine fault tolerance?
Mini Drill or Application
For a system you have worked on, list its deployment environments (dev, staging, prod, edge, third-party SaaS). For each, write down the weakest failure model that is honest. Note which environments quietly assume crash-stop but could tolerate Byzantine if taken seriously.