Security / Observability Code Katas

Focused, repeatable exercises designed to build fluency across the module's two halves. Complete each kata several times until the setup feels automatic.

Kata 1: STRIDE on a 3-Service System

Time limit: 25 minutes Goal: Produce a one-page threat model without reaching for notes Setup: Start from this shape -- client -> API -> worker -> database -> object store. Run it as a Docker Compose/kind diagram with mocks first; only map it to one cloud account when provider-specific controls are the point

Steps:

Draw the system (boxes, arrows, trust boundaries) in under 5 minutes.
List 3 assets.
Write one finding per STRIDE letter with a concrete attacker move.
Write one mitigation per finding, named and owned.
Declare which finding you would address first in a one-week budget and why.

Repeat until: You can finish within the time limit and explain each mitigation in one sentence each, from memory.

Kata 2: Rotation Workflow for a Database Credential

Time limit: 20 minutes Goal: Produce an end-to-end rotation runbook for a Postgres credential used by one app Setup: Assume a vault is available. You may choose static-with-rotation or dynamic secrets; justify the choice.

Steps:

Name the secret and its consumers.
Write the rotation steps: trigger, new value, distribution, verification, revocation, rollback, audit.
For each step, write the exact command or API call (pseudo-code is fine) that a peer would run.
Mark the reversible vs irreversible steps.

Repeat until: A peer can execute the runbook without asking you a question.

Kata 3: Instrument a Service Spec with OpenTelemetry

Time limit: 25 minutes Goal: Write a minimal but production-credible instrumentation spec for one endpoint Setup: Pick one HTTP endpoint (for example POST /checkout or GET /accounts/:id). Choose any language.

Steps:

Sketch the span tree for a typical request (root span + 2-3 child spans).
List attributes per span, using OpenTelemetry semantic conventions where they exist (http.request.method, http.response.status_code, db.system, etc.); use a service-prefixed name for anything custom.
Write 5-10 lines of code (real or pseudo) showing the root span creation, status handling on error, and one child span for an outbound call.
Declare the sampling strategy in one sentence.
Declare the correlation fields (trace_id, span_id) present in logs and metrics for the same request.

Repeat until: You can produce the snippet in under 25 minutes without looking up attribute names for the common HTTP path.

Kata 4: SLO + Alert Spec

Time limit: 20 minutes Goal: Produce a defensible SLO and one paging alert backed by it Setup: Pick one user-visible operation (checkout, signup, search).

Steps:

Define the SLI (Service Level Indicator): exact numerator and denominator, e.g. count(requests where status_class == "2xx") / count(requests).
Define the SLO: target (e.g. 99.5%) and window (e.g. rolling 28 days).
Define the paging alert: condition, time window, severity, runbook filename. Confirm it fires on a symptom, not a cause.
Define one diagnostic (non-paging) alert for the same area, capturing a useful cause signal.
Name the biggest class of false positives and how you will suppress it.

Repeat until: You can produce the full spec in under 20 minutes and justify each choice in one sentence.

Kata 5: Runbook for a Real Failure Mode

Time limit: 25 minutes Goal: Write a one-page runbook that a tired teammate can execute Setup: Choose one of these failure modes:

payment provider returns 5xx on 10% of calls for 15 minutes
dependency cache starts missing 80% of the time after a deploy
a scheduled batch job fails silently and does not produce output
a new container image fails admission due to a missing signature
a CWPP alert fires for a shell spawned inside a production container

Steps (use the 5-part template from Concept 15):

Trigger
Immediate verification (under 60 seconds)
Impact
Diagnostic steps (decision tree)
Mitigations (reversible first) and rollback

Then peer-test: hand it to someone, have them read it in two minutes, and have them tell you what they would do first.

Repeat until: A peer can state the first move in one sentence after a two-minute read.

Completion Standard

Can complete each kata within its time limit on a cold start
Can explain every step out loud without reading the page
Can point to one external citation (OWASP, NIST, OpenTelemetry, SRE book, or cloud provider) that backs each kata's core claim, and can show local evidence unless paid cloud integration was justified
Can combine the five katas into one story: "here is the system, here is the threat model, here is the rotation, here is the instrumentation, here is the SLO and alert, here is the runbook"

Kata 1: STRIDE on a 3-Service System​

Kata 2: Rotation Workflow for a Database Credential​

Kata 3: Instrument a Service Spec with OpenTelemetry​

Kata 4: SLO + Alert Spec​

Kata 5: Runbook for a Real Failure Mode​

Completion Standard​

Kata 1: STRIDE on a 3-Service System

Kata 2: Rotation Workflow for a Database Credential

Kata 3: Instrument a Service Spec with OpenTelemetry

Kata 4: SLO + Alert Spec

Kata 5: Runbook for a Real Failure Mode

Completion Standard