Skip to main content

Chaos Engineering and Game Days

What This Concept Is

Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. The Principles of Chaos site states four principles:

  1. Build a hypothesis around steady-state behavior.
  2. Vary real-world events (server kills, network latency, dependency failures).
  3. Run experiments in production (or a carefully chosen subset).
  4. Automate experiments to run continuously.
  5. Minimize blast radius - always have a kill switch.

A game day is the organized, scheduled, human-in-the-loop version: a team plans a failure scenario, announces it, injects it, observes response, and debriefs. The deliverable is not a fix; it is evidence about whether the team, the tooling, and the system actually behave as designed when something goes wrong.

"We tested it in staging" is not an answer. Staging lacks the traffic, the data skew, the dependency diversity, and the emotional conditions of a real incident.

Why It Matters Here

You do not discover cascading failures by reading the code. You do not discover gray failures by looking at the dashboard. The only way to know a system withstands a class of failure is to cause that failure and observe the result - deliberately, with guardrails.

This is also the only honest way to validate an SLO. If you claim 99.99% availability because you have three AZs and a circuit breaker, chaos is the experiment that tells you whether the circuit breaker actually engages on the failure modes that happen. Without experiments, the SLO is an aspiration.

Netflix's Chaos Monkey, AWS's Fault Injection Simulator, and Gremlin productized these practices. The core discipline is older and does not require any tool; you can run a tabletop game day on a whiteboard in an afternoon.

Concrete Example

A team claims their payment service tolerates a downstream fraud-check outage by failing open (approving the payment if fraud check is unavailable).

Hypothesis (steady state): Request success rate >= 99.9%, p99 latency <= 500ms.

Experiment: In production, block 100% of egress traffic to the fraud-check service for 10 minutes.

Blast radius controls:

  • Run during low-traffic window (Tuesday 10am local, not Friday 6pm).
  • Confined to one region; other regions remain normal.
  • Kill switch: a single command restores egress.
  • Abort conditions: success rate < 99% for 1 minute, or p99 > 2000ms for 1 minute.

Observations to collect: request success rate during injection, p99 latency, fraud-allow-rate change, dashboard lag time, whether any alert fired.

Possible outcomes:

  1. Success. Circuit breaker opened within 30 seconds, fraud-check calls returned default "allow," SLO held. Team confirms the claim.
  2. Failure. Calls piled up waiting for fraud-check timeout (which was 5s, not 500ms), thread pool exhausted, service went down. Team learns: timeout was wrong, circuit breaker never engaged. Fix, re-run.

Either way, the team now knows. Before the experiment, they were guessing.

Common Confusion / Misconception

"Chaos engineering means randomly breaking things." It does not. The Principles of Chaos are explicit: hypothesis-driven, bounded, and with a kill switch. Breaking things randomly without a hypothesis is vandalism, not engineering.

"We can't run chaos in production - it's too risky." You already run chaos in production; users just call it an outage. Chaos engineering runs a smaller version of that outage on a schedule so the team can prepare, control the blast radius, and learn. The question is not whether production will fail - it will - but whether you want the first experience to be an emergency or a rehearsal.

"Game days are too expensive." A 2-hour game day that exposes one cascading-failure bug saves, on average, many hours of real outage. The ROI is easier to defend than most reliability investments.

How To Use It

Start small, then expand.

  1. Tabletop first. Before touching production, run a whiteboard game day. The team walks through "dependency X is down; what happens, who pages, who mitigates?" You will find gaps without injecting anything.
  2. Write the experiment down. Hypothesis (what steady state looks like), injection (what you change), blast-radius controls (where it applies), abort conditions (when to stop), evidence (what you will measure).
  3. Get explicit approval. Product, leadership, on-call, dependencies. Game days fail when someone is surprised.
  4. Run it, observe, debrief. Debrief is not optional - the point is learning.
  5. Automate the experiments that have earned trust. A failed deploy, a killed pod, a degraded dependency - these can run continuously via Chaos Monkey-style tooling.
  6. Protect the user. Always have a kill switch. Always bound blast radius. Never run experiments that risk data loss without a clear recovery plan.

Check Yourself

  1. Why is "we tested it in staging" insufficient evidence that a system survives a production failure?
  2. Name three things a chaos experiment must specify before it is run.
  3. What is a kill switch and why is a chaos experiment without one irresponsible?

Mini Drill or Application

Pick a service you rely on. Write a 1-page chaos experiment: hypothesis, injection, blast-radius controls, abort conditions, expected evidence. Then imagine running it: what do you think will happen, and what would surprise you? The surprise is usually where the real learning is.

Transfer / Where This Shows Up Later

Chaos is the validation layer for every reliability claim you make. Without it, the claims are theory.

  • This module, concept 07 (SLOs): your SLO is an aspiration until a chaos experiment has tested whether your mitigations actually hold under the failure modes you claim to survive.
  • This module, concept 08 (failure modes): chaos is the empirical tool that finds cascading/correlated/gray failures your dependency diagram missed.
  • This module, concept 14 (incident lifecycle): game days exercise the people parts of the lifecycle - paging chain, IC handoff, runbook clarity - that code-level tests cannot.
  • This module, concept 15 (postmortems): chaos failures get postmortems too, and they are among the cheapest reviews to run because the team was already set up to observe.
  • S8 M5 (leadership): chaos is an investment a VP must approve; framing the ROI (outage hours avoided vs game-day hours invested) is a leadership skill.
  • S9 M3 (Kubernetes): Litmus, Chaos Mesh, and Gremlin run container-native chaos; knowing which primitive (pod-kill, network-partition, resource-starvation) matches which experiment is part of the operator skillset.
  • S10 M4 (operational readiness): the readiness review explicitly asks for the last three chaos experiments and their outcomes. "We haven't run any" is a reject.

Read This Only If Stuck

Local chunks (book anchors)

External canonical references