Skip to main content

Identifying Your Capstone's Three Most Likely Failures

What This Concept Is

Capacity to fail is unlimited; time to prepare is not. Failure planning starts with ranking. The question is not "what could possibly fail?" -- that list is infinite. The question is "what is likely and high impact, given the system I actually built?"

A simple 2x2 gets most of the way:

Low impactHigh impact
UnlikelyIgnoreAccept + monitor
LikelyMitigate cheaplyPrepare: mitigation + runbook

You need to identify three failures in the bottom-right cell. Three is enough to be concrete. Fewer is under-preparation; more is theater.

The ranking is informed by: your architecture (single-AZ? single DB? one external dependency?), your traffic (thin? spiky?), your dependencies (provider SLAs? free tier limits?), and history (what has already broken in development and staging). Near-misses are the best predictor of what will actually hit production.

Failure-mode taxonomy helps the brainstorm. Most real incidents fall into one of four buckets: cascading (one slow dependency drags down its callers), correlated (many instances of the "same" failure at once because they share a hidden resource), gray (partial failure where some requests succeed), and human (bad deploy, wrong flag, bad config). Your list should have at least one from each.

Why It Matters Here (In the Capstone)

The next two concepts -- mitigations and backup/recovery -- ask you to invest engineering time. Without a ranking, you will either invest in the wrong failures or spread investment so thin that nothing is actually prepared.

Likewise, the runbooks in Cluster 5 cover exactly the incidents on this list. "Top three" is a forcing function: it concentrates your attention on failures that will actually happen. At PRR time you want to be able to say, "these three are what I worried about most, here is the mitigation for each, here is the runbook for each, here is the budget impact of each in my SLO math."

Concrete Example -- from a real capstone

For the webhook-handler capstone (single-region, small Postgres, one downstream notification API):

Brainstormed failure set (raw):

  • Postgres instance restart
  • AZ outage
  • Region outage
  • Notification API 5xx storm
  • Notification API slow (not down)
  • Secrets Manager throttling
  • Queue backlog
  • Provider sends 10x normal webhook volume
  • Disk fills up
  • TLS cert expires
  • Leaked API key
  • Bad deploy
  • Schema migration data loss
  • DDOS / spam traffic
  • Developer laptop compromised

Rank by likelihood * impact, informed by architecture:

#FailureLikelihoodImpactEvidence / notes
1Bad deployHighHighHappens monthly on a team of 1; one regression can blow budget in minutes
2Notification API slowHighMedium-HighWe have no timeout; backpressure will propagate to our SLO
3Schema migration data lossMediumCatastrophicLow frequency but unbounded blast radius; one bad migration = data incident
4Provider volume spikeMediumMediumWebhook bursts are normal; queue smooths but API may 429
5TLS cert expiresLowHighAutomated, but the automation has failed before
6AZ outageLowMediumProvider claims 99.99%; accept + monitor
7Region outageVery lowVery highMulti-region is out of scope; accept + publish expectation

Top three for preparation: bad deploy, slow notification API, schema migration data loss.

Notice that AZ outage and region outage are not in the top three. In a capstone with single-region deployment, mitigating region failure is expensive and low-probability during the demo window. It goes in "accept + monitor" until the architecture justifies it. Documenting the decision to accept is as important as the list itself -- it is what prevents someone from "adding multi-region support" as a reaction to the wrong alert later.

For each top-three failure, write a one-paragraph operational summary:

Bad deploy. Symptom: sudden rise in 5xx on the /webhook/receive endpoint immediately following a deploy. How we would know: error-rate SLO burn-rate alert fires within ~2 minutes. First thing to check: compare git log --since="10 minutes ago" against the burn start; confirm via deploy tag correlation panel in the dashboard. Estimated blast radius: 100% of new ingress until rollback.

That one paragraph is the precise input to the mitigation (concept 11) and the runbook (concept 13).

Common Confusion / Misconceptions

"We should plan for everything." You can plan for everything, or you can plan for three things well. For a capstone you do not have the budget to do the first. Do the second.

"The most exciting failure wins." Region outages, zero-days, state-actor attacks -- all intellectually rich, mostly irrelevant to your capstone. Rank by the evidence in your own git history and staging logs. The boring failures are the high-probability ones.

"We've never had a bad deploy, so it's not on the list." Survivor bias. If your capstone has been live for six weeks with two developers, you have a small sample. Look at near misses (rolled back at the last second, caught in staging) as signal for what will hit prod once you speed up.

"High impact alone puts it in the top three." Impact matters, but low-probability + high-impact items belong in "accept + monitor" until probability rises. A specific, named failure that is both likely and high impact beats a theoretical high-impact one every time.

"The top three never changes." It should change as the system changes. When you add a new dependency, the top three gets re-ranked. When a near-miss happens, the list gets re-ranked. If it has been static for six months, either your system is frozen or you are not looking.

How To Use It (In Your Capstone)

  1. Brainstorm failure modes against your real architecture. Aim for 15-25 entries.
  2. For each, give an honest one-of-{low, medium, high} for likelihood and impact, citing evidence where possible.
  3. Informed by actual history (deploys, staging incidents, dependency SLAs), promote three to the "prepare" cell.
  4. Write a one-paragraph statement per top-three failure: what breaks, how we would know, first thing to check, estimated blast radius.
  5. Commit to library/raw/top-failures.md. Link from the PRR checklist.
  6. Re-rank after any significant architecture change or any near-miss incident.

See also (integrative)

Check Yourself

  1. Why does "AZ outage" often end up in "accept + monitor" for a capstone instead of "prepare"?
  2. What kind of evidence should shift a failure from "unlikely" to "likely"?
  3. Why does the ranking feed both Cluster 4 mitigations and Cluster 5 runbooks, not just one?
  4. Give an example of a gray failure mode for your capstone and explain why it is often missed during brainstorming.
  5. Under what circumstances does the top-three list change, and who is responsible for updating it?

Mini Drill or Application (Capstone-scoped)

  1. Open library/raw/top-failures.md. List 15-25 failure modes for your capstone. At least one per category (cascading, correlated, gray, human).
  2. Score each likelihood and impact, using your staging logs and history as evidence.
  3. Pick the three that land in the top-right cell.
  4. For each, write: Failure / How we would know / First thing to check / Estimated blast radius.
  5. Identify the two or three failures that go in the "accept + monitor" cell. Write one sentence each explaining the acceptance.
  6. These three are the inputs to concepts 11, 12, and 13. Do not proceed without them.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.