Operational Katas
Focused, repeatable exercises. Complete each kata multiple times until the setup feels automatic. The goal is not novelty -- it is the ability to do the right thing fast, on your real capstone, under stress.
Kata 1: Write One Real SLI + SLO + Alert
Time limit: 20 minutes
Goal: Go from blank page to committed SLO document and wired burn-rate alert.
Setup: Your capstone repo, a blank library/raw/slo.md, and a metric source you can query.
- Pick one user journey that defines "working."
- Write the SLI as a ratio of events:
good / total. - Pick a target and window (e.g., 99.5% / 30d). Defend the numbers with current traffic.
- Compute the error budget in percent and in absolute events.
- Write one-sentence consequence.
- Encode the fast-burn alert in your monitoring tool with the multi-window pattern from Cluster 1 Concept 3.
Repeat until: You can do this start-to-finish in ≤ 20 minutes on any realistic service, including explaining every threshold without lookup.
Kata 2: Instrument a Trace for One Critical Path
Time limit: 30 minutes Goal: One real user request produces a complete distributed trace across every significant hop. Setup: Your capstone with at least one external call and one async boundary.
- Identify the entry point and every downstream call on the critical path.
- Add OpenTelemetry (or equivalent) instrumentation at the entry point.
- Propagate
traceparent(W3C headers) into every HTTP client call and into queue message attributes. - On the consumer side, extract
traceparentand open a child span. - Fire one real request. Open the trace. Verify every hop appears.
- If any hop is missing, fix it and repeat.
Repeat until: The full critical-path trace comes back correct on a fresh request in under 30 minutes.
Kata 3: STRIDE on the Capstone
Time limit: 45 minutes Goal: One STRIDE pass over the highest-value trust-boundary flow with at least one gap walked to a deployed mitigation. Setup: Your DFD and a blank six-row STRIDE table for one flow.
- Draw or refresh the DFD; mark trust boundaries.
- Pick the flow that crosses the highest-value boundary.
- Fill a six-row STRIDE table. Each cell: mitigated / gap / accepted, with evidence.
- Pick the most valuable gap.
- Walk it end-to-end: threat -> evidence -> mitigation -> detection -> residual -> deployed.
- Commit the worksheet.
Repeat until: You can produce a credible STRIDE pass on any flow in under 45 minutes, with at least one walk completed.
Kata 4: Write a 1-Page Runbook for a Known Incident
Time limit: 25 minutes
Goal: A runbook for a top-three failure that a peer says they could follow alone.
Setup: One of your top-three failures from library/raw/top-failures.md.
- Open the runbook template (Cluster 5 Concept 13).
- Fill Symptoms -- specific alert names, dashboard signals, user-report phrasing.
- Fill Impact -- in one sentence, tied to the SLO.
- Fill Checks -- 3-6 ordered diagnostic steps.
- Fill Mitigations -- each with an expected effect and a rollback.
- Fill Escalation -- when to post, when to declare, who is the fallback.
- Commit to
library/raw/runbooks/<slug>.md. Ask a peer to read only the runbook and say whether they could follow it alone.
Repeat until: A peer can say "yes, I could follow this cold" on a first read.
Completion Standard
- Each kata completed within its time limit at least twice
- Each kata produced a concrete artifact in the capstone repo
- You no longer need to look up thresholds, templates, or step numbers for any kata
- A trusted peer has validated the runbook and the SLO document
- All four katas feed directly into PRR items in
library/raw/prr.md