Skip to main content

Operational Katas

Focused, repeatable exercises. Complete each kata multiple times until the setup feels automatic. The goal is not novelty -- it is the ability to do the right thing fast, on your real capstone, under stress.

Kata 1: Write One Real SLI + SLO + Alert

Time limit: 20 minutes Goal: Go from blank page to committed SLO document and wired burn-rate alert. Setup: Your capstone repo, a blank library/raw/slo.md, and a metric source you can query.

  1. Pick one user journey that defines "working."
  2. Write the SLI as a ratio of events: good / total.
  3. Pick a target and window (e.g., 99.5% / 30d). Defend the numbers with current traffic.
  4. Compute the error budget in percent and in absolute events.
  5. Write one-sentence consequence.
  6. Encode the fast-burn alert in your monitoring tool with the multi-window pattern from Cluster 1 Concept 3.

Repeat until: You can do this start-to-finish in ≤ 20 minutes on any realistic service, including explaining every threshold without lookup.

Kata 2: Instrument a Trace for One Critical Path

Time limit: 30 minutes Goal: One real user request produces a complete distributed trace across every significant hop. Setup: Your capstone with at least one external call and one async boundary.

  1. Identify the entry point and every downstream call on the critical path.
  2. Add OpenTelemetry (or equivalent) instrumentation at the entry point.
  3. Propagate traceparent (W3C headers) into every HTTP client call and into queue message attributes.
  4. On the consumer side, extract traceparent and open a child span.
  5. Fire one real request. Open the trace. Verify every hop appears.
  6. If any hop is missing, fix it and repeat.

Repeat until: The full critical-path trace comes back correct on a fresh request in under 30 minutes.

Kata 3: STRIDE on the Capstone

Time limit: 45 minutes Goal: One STRIDE pass over the highest-value trust-boundary flow with at least one gap walked to a deployed mitigation. Setup: Your DFD and a blank six-row STRIDE table for one flow.

  1. Draw or refresh the DFD; mark trust boundaries.
  2. Pick the flow that crosses the highest-value boundary.
  3. Fill a six-row STRIDE table. Each cell: mitigated / gap / accepted, with evidence.
  4. Pick the most valuable gap.
  5. Walk it end-to-end: threat -> evidence -> mitigation -> detection -> residual -> deployed.
  6. Commit the worksheet.

Repeat until: You can produce a credible STRIDE pass on any flow in under 45 minutes, with at least one walk completed.

Kata 4: Write a 1-Page Runbook for a Known Incident

Time limit: 25 minutes Goal: A runbook for a top-three failure that a peer says they could follow alone. Setup: One of your top-three failures from library/raw/top-failures.md.

  1. Open the runbook template (Cluster 5 Concept 13).
  2. Fill Symptoms -- specific alert names, dashboard signals, user-report phrasing.
  3. Fill Impact -- in one sentence, tied to the SLO.
  4. Fill Checks -- 3-6 ordered diagnostic steps.
  5. Fill Mitigations -- each with an expected effect and a rollback.
  6. Fill Escalation -- when to post, when to declare, who is the fallback.
  7. Commit to library/raw/runbooks/<slug>.md. Ask a peer to read only the runbook and say whether they could follow it alone.

Repeat until: A peer can say "yes, I could follow this cold" on a first read.

Completion Standard

  • Each kata completed within its time limit at least twice
  • Each kata produced a concrete artifact in the capstone repo
  • You no longer need to look up thresholds, templates, or step numbers for any kata
  • A trusted peer has validated the runbook and the SLO document
  • All four katas feed directly into PRR items in library/raw/prr.md