Skip to main content

Module 4: Scale, Reliability & Performance: Mistake Clinic

This clinic turns wrong moves into reusable judgment. Use it after each practice page and again before the quiz or checkpoint.


Module-Specific Mistake Radar

Start with these traps. Replace or extend them with real mistakes from your own work.

Mistake to look forWhere it shows upSymptomRepair evidence
Finishing Performance Profiling Lab with only a final answerPerformance Profiling LabThe work has no failed case, trace, test, proof gap, or design stress point.Add the smallest broken example and show the repair that changes the result.
Finishing Scaling Design Workshop with only a final answerScaling Design WorkshopThe work has no failed case, trace, test, proof gap, or design stress point.Add the smallest broken example and show the repair that changes the result.
Finishing Reliability and SLO Clinic with only a final answerReliability and SLO ClinicThe work has no failed case, trace, test, proof gap, or design stress point.Add the smallest broken example and show the repair that changes the result.
Finishing Scale, Reliability, and Performance Katas with only a final answerScale, Reliability, and Performance KatasThe work has no failed case, trace, test, proof gap, or design stress point.Add the smallest broken example and show the repair that changes the result.
Treating Latency, Throughput, Utilization, and the USE / RED / Four Golden Signals as vocabulary instead of a toolLatency, Throughput, Utilization, and the USE / RED / Four Golden SignalsThe explanation names the concept but cannot decide between two cases.Write one example, one non-example, and the rule that separates them.
Treating Percentile Latency and Why Averages Lie as vocabulary instead of a toolPercentile Latency and Why Averages LieThe explanation names the concept but cannot decide between two cases.Write one example, one non-example, and the rule that separates them.

Practice Mistake Checks

Pull any miss from these checks into your mistake log.

Performance Profiling Lab

Source: practice/01-performance-profiling-lab.md

For each statement, identify the error:

  1. "Our average response time is 50ms, so users are happy."
  2. "We added 10 more CPUs and throughput only went up 2x - the load balancer must be broken."
  3. "CPU is at 60% so we have 40% headroom."
  4. "p95 of p99 across our ten servers was 200ms."
  5. "At 95% CPU utilization we're making maximum use of the machine."

Scaling Design Workshop

Source: practice/02-scaling-design-workshop.md

For each, identify the error:

  1. "We made the service horizontally scalable by adding a load balancer."
  2. "Sticky sessions are fine as long as the load balancer is smart."
  3. "Write-behind is safe because we eventually write to the DB."
  4. "We have a CDN, so we don't need any other caching."
  5. "The cache was slow so we doubled its memory."

Reliability and SLO Clinic

Source: practice/03-reliability-and-slo-clinic.md

For each, identify the error:

  1. "Our SLO is 99.999% because that's what AWS offers."
  2. "Availability was 99.93% this month, so we're within the 99.9% SLO."
  3. "We have redundancy across three servers, so correlated failure is impossible."
  4. "Chaos engineering is just deliberately breaking things in production."
  5. "The dashboard is green so nothing is wrong."

Repair Protocol

For each real mistake:

  1. Reproduce the failure on the smallest example, trace, proof, query, command, or design sketch.
  2. Name the hidden assumption.
  3. Repair the artifact.
  4. Save evidence that changed: failing then passing test, corrected proof step, revised diagram, safer command, benchmark, or review note.
  5. Add one retrieval card beginning with Check... before... or Do not use... when....

Mistake Log

DateMistakeSymptomRoot causeRepair evidenceRetrieval card
StarterPick one radar row aboveExplain how it would fail in this moduleName the assumptionAdd a counterexample or corrected artifactWrite the card before closing the page

Completion Standard

  • At least five real mistakes are logged.
  • At least two mistakes include a counterexample or failing test.
  • At least one mistake connects to an older semester skill.
  • At least one correction changes code, a proof, a diagram, a command transcript, a query, or a design decision.