Module 4: Scale, Reliability & Performance: Mistake Clinic
This clinic turns wrong moves into reusable judgment. Use it after each practice page and again before the quiz or checkpoint.
Module-Specific Mistake Radar
Start with these traps. Replace or extend them with real mistakes from your own work.
| Mistake to look for | Where it shows up | Symptom | Repair evidence |
|---|---|---|---|
| Finishing Performance Profiling Lab with only a final answer | Performance Profiling Lab | The work has no failed case, trace, test, proof gap, or design stress point. | Add the smallest broken example and show the repair that changes the result. |
| Finishing Scaling Design Workshop with only a final answer | Scaling Design Workshop | The work has no failed case, trace, test, proof gap, or design stress point. | Add the smallest broken example and show the repair that changes the result. |
| Finishing Reliability and SLO Clinic with only a final answer | Reliability and SLO Clinic | The work has no failed case, trace, test, proof gap, or design stress point. | Add the smallest broken example and show the repair that changes the result. |
| Finishing Scale, Reliability, and Performance Katas with only a final answer | Scale, Reliability, and Performance Katas | The work has no failed case, trace, test, proof gap, or design stress point. | Add the smallest broken example and show the repair that changes the result. |
| Treating Latency, Throughput, Utilization, and the USE / RED / Four Golden Signals as vocabulary instead of a tool | Latency, Throughput, Utilization, and the USE / RED / Four Golden Signals | The explanation names the concept but cannot decide between two cases. | Write one example, one non-example, and the rule that separates them. |
| Treating Percentile Latency and Why Averages Lie as vocabulary instead of a tool | Percentile Latency and Why Averages Lie | The explanation names the concept but cannot decide between two cases. | Write one example, one non-example, and the rule that separates them. |
Practice Mistake Checks
Pull any miss from these checks into your mistake log.
Performance Profiling Lab
Source: practice/01-performance-profiling-lab.md
For each statement, identify the error:
- "Our average response time is 50ms, so users are happy."
- "We added 10 more CPUs and throughput only went up 2x - the load balancer must be broken."
- "CPU is at 60% so we have 40% headroom."
- "p95 of p99 across our ten servers was 200ms."
- "At 95% CPU utilization we're making maximum use of the machine."
Scaling Design Workshop
Source: practice/02-scaling-design-workshop.md
For each, identify the error:
- "We made the service horizontally scalable by adding a load balancer."
- "Sticky sessions are fine as long as the load balancer is smart."
- "Write-behind is safe because we eventually write to the DB."
- "We have a CDN, so we don't need any other caching."
- "The cache was slow so we doubled its memory."
Reliability and SLO Clinic
Source: practice/03-reliability-and-slo-clinic.md
For each, identify the error:
- "Our SLO is 99.999% because that's what AWS offers."
- "Availability was 99.93% this month, so we're within the 99.9% SLO."
- "We have redundancy across three servers, so correlated failure is impossible."
- "Chaos engineering is just deliberately breaking things in production."
- "The dashboard is green so nothing is wrong."
Repair Protocol
For each real mistake:
- Reproduce the failure on the smallest example, trace, proof, query, command, or design sketch.
- Name the hidden assumption.
- Repair the artifact.
- Save evidence that changed: failing then passing test, corrected proof step, revised diagram, safer command, benchmark, or review note.
- Add one retrieval card beginning with Check... before... or Do not use... when....
Mistake Log
| Date | Mistake | Symptom | Root cause | Repair evidence | Retrieval card |
|---|---|---|---|---|---|
| Starter | Pick one radar row above | Explain how it would fail in this module | Name the assumption | Add a counterexample or corrected artifact | Write the card before closing the page |
Completion Standard
- At least five real mistakes are logged.
- At least two mistakes include a counterexample or failing test.
- At least one mistake connects to an older semester skill.
- At least one correction changes code, a proof, a diagram, a command transcript, a query, or a design decision.