Identifying Your Capstone's Three Most Likely Failures

What This Concept Is

Capacity to fail is unlimited; time to prepare is not. Failure planning starts with ranking. The question is not "what could possibly fail?" -- that list is infinite. The question is "what is likely and high impact, given the system I actually built?"

A simple 2x2 gets most of the way:

	Low impact	High impact
Unlikely	Ignore	Accept + monitor
Likely	Mitigate cheaply	Prepare: mitigation + runbook

You need to identify three failures in the bottom-right cell. Three is enough to be concrete. Fewer is under-preparation; more is theater.

The ranking is informed by: your architecture (single-AZ? single DB? one external dependency?), your traffic (thin? spiky?), your dependencies (provider SLAs? free tier limits?), and history (what has already broken in development and staging). Near-misses are the best predictor of what will actually hit production.

Failure-mode taxonomy helps the brainstorm. Most real incidents fall into one of four buckets: cascading (one slow dependency drags down its callers), correlated (many instances of the "same" failure at once because they share a hidden resource), gray (partial failure where some requests succeed), and human (bad deploy, wrong flag, bad config). Your list should have at least one from each.

Why It Matters Here (In the Capstone)

The next two concepts -- mitigations and backup/recovery -- ask you to invest engineering time. Without a ranking, you will either invest in the wrong failures or spread investment so thin that nothing is actually prepared.

Likewise, the runbooks in Cluster 5 cover exactly the incidents on this list. "Top three" is a forcing function: it concentrates your attention on failures that will actually happen. At PRR time you want to be able to say, "these three are what I worried about most, here is the mitigation for each, here is the runbook for each, here is the budget impact of each in my SLO math."

Concrete Example -- from a real capstone

For the webhook-handler capstone (single-region, small Postgres, one downstream notification API):

Brainstormed failure set (raw):

Postgres instance restart
AZ outage
Region outage
Notification API 5xx storm
Notification API slow (not down)
Secrets Manager throttling
Queue backlog
Provider sends 10x normal webhook volume
Disk fills up
TLS cert expires
Leaked API key
Bad deploy
Schema migration data loss
DDOS / spam traffic
Developer laptop compromised

Rank by likelihood * impact, informed by architecture:

#	Failure	Likelihood	Impact	Evidence / notes
1	Bad deploy	High	High	Happens monthly on a team of 1; one regression can blow budget in minutes
2	Notification API slow	High	Medium-High	We have no timeout; backpressure will propagate to our SLO
3	Schema migration data loss	Medium	Catastrophic	Low frequency but unbounded blast radius; one bad migration = data incident
4	Provider volume spike	Medium	Medium	Webhook bursts are normal; queue smooths but API may 429
5	TLS cert expires	Low	High	Automated, but the automation has failed before
6	AZ outage	Low	Medium	Provider claims 99.99%; accept + monitor
7	Region outage	Very low	Very high	Multi-region is out of scope; accept + publish expectation

Top three for preparation: bad deploy, slow notification API, schema migration data loss.

Notice that AZ outage and region outage are not in the top three. In a capstone with single-region deployment, mitigating region failure is expensive and low-probability during the demo window. It goes in "accept + monitor" until the architecture justifies it. Documenting the decision to accept is as important as the list itself -- it is what prevents someone from "adding multi-region support" as a reaction to the wrong alert later.

For each top-three failure, write a one-paragraph operational summary:

Bad deploy. Symptom: sudden rise in 5xx on the /webhook/receive endpoint immediately following a deploy. How we would know: error-rate SLO burn-rate alert fires within ~2 minutes. First thing to check: compare git log --since="10 minutes ago" against the burn start; confirm via deploy tag correlation panel in the dashboard. Estimated blast radius: 100% of new ingress until rollback.

That one paragraph is the precise input to the mitigation (concept 11) and the runbook (concept 13).

Common Confusion / Misconceptions

"We should plan for everything." You can plan for everything, or you can plan for three things well. For a capstone you do not have the budget to do the first. Do the second.

"The most exciting failure wins." Region outages, zero-days, state-actor attacks -- all intellectually rich, mostly irrelevant to your capstone. Rank by the evidence in your own git history and staging logs. The boring failures are the high-probability ones.

"We've never had a bad deploy, so it's not on the list." Survivor bias. If your capstone has been live for six weeks with two developers, you have a small sample. Look at near misses (rolled back at the last second, caught in staging) as signal for what will hit prod once you speed up.

"High impact alone puts it in the top three." Impact matters, but low-probability + high-impact items belong in "accept + monitor" until probability rises. A specific, named failure that is both likely and high impact beats a theoretical high-impact one every time.

"The top three never changes." It should change as the system changes. When you add a new dependency, the top three gets re-ranked. When a near-miss happens, the list gets re-ranked. If it has been static for six months, either your system is frozen or you are not looking.

How To Use It (In Your Capstone)

Brainstorm failure modes against your real architecture. Aim for 15-25 entries.
For each, give an honest one-of-{low, medium, high} for likelihood and impact, citing evidence where possible.
Informed by actual history (deploys, staging incidents, dependency SLAs), promote three to the "prepare" cell.
Write a one-paragraph statement per top-three failure: what breaks, how we would know, first thing to check, estimated blast radius.
Commit to library/raw/top-failures.md. Link from the PRR checklist.
Re-rank after any significant architecture change or any near-miss incident.

Check Yourself

Why does "AZ outage" often end up in "accept + monitor" for a capstone instead of "prepare"?
What kind of evidence should shift a failure from "unlikely" to "likely"?
Why does the ranking feed both Cluster 4 mitigations and Cluster 5 runbooks, not just one?
Give an example of a gray failure mode for your capstone and explain why it is often missed during brainstorming.
Under what circumstances does the top-three list change, and who is responsible for updating it?

Mini Drill or Application (Capstone-scoped)

Open library/raw/top-failures.md. List 15-25 failure modes for your capstone. At least one per category (cascading, correlated, gray, human).
Score each likelihood and impact, using your staging logs and history as evidence.
Pick the three that land in the top-right cell.
For each, write: Failure / How we would know / First thing to check / Estimated blast radius.
Identify the two or three failures that go in the "accept + monitor" cell. Write one sentence each explaining the acceptance.
These three are the inputs to concepts 11, 12, and 13. Do not proceed without them.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.

Building Secure and Reliable Systems - primary security and reliability backbone.
Software Engineering at Google - operational process and engineering discipline.
Designing Distributed Systems - service and reliability pattern support.

What This Concept Is​

Why It Matters Here (In the Capstone)​

Concrete Example -- from a real capstone​

Common Confusion / Misconceptions​

How To Use It (In Your Capstone)​

See also (integrative)​

Check Yourself​

Mini Drill or Application (Capstone-scoped)​

Source Backbone​