Module 4: Operational Readiness & Security Review: Case Studies
These case studies turn the capstone into something you can operate and defend under scrutiny.
Case Study 1: SLO For The Critical Path
Scenario: The capstone says "reliable" but has no measured user outcome.
Source anchor: Google SRE's Service Level Objectives frames SLOs as explicit targets on measured SLIs rather than vague reliability claims.
Module concepts: SLI, SLO, error budget, user journey.
Wrong Approach
Use uptime as the only reliability measure.
Better Approach
Define one capstone SLO:
SLI:
successful analysis requests / valid analysis requests
SLO:
99% over 7 days
Consequence:
stop feature work if missed before demo
Failure Mode
The capstone claims reliability without a user-centered metric, so there is no trigger for operational action and nothing concrete to defend in review.
Project / Capstone Connection
Use this to choose the one success measure your capstone should monitor through the final demo period.
Tradeoff Table
| Option | Benefit | Cost | Better fit when |
|---|---|---|---|
| Uptime-only target | easy to collect | weak tie to user outcome | the system is pure infrastructure |
| User-path SLO | meaningful reliability evidence | needs clearer instrumentation | the capstone serves a concrete workflow |
Required Artifact
Write one SLI/SLO/error-budget policy for the capstone.
Case Study 2: Dashboard That Answers No Question
Scenario: The dashboard has CPU, memory, disk, and request count, but cannot answer "are users succeeding?"
Source anchor: Google SRE monitoring guidance emphasizes symptom-first monitoring: dashboards should answer user and dependency questions before they become metric galleries.
Module concepts: dashboard, symptom, golden signals, runbook.
Wrong Approach
Graph everything the platform exposes.
Better Approach
Answer three questions:
Can users complete the critical path?
Is the system slow?
What dependency is causing failures?
Failure Mode
An incident starts, but the dashboard cannot distinguish user harm from infrastructure noise, so diagnosis stalls and the wrong fixes get attempted.
Project / Capstone Connection
Use this when deciding which panels belong on the one dashboard you will actually open during a capstone incident or defense.
Tradeoff Table
| Option | Benefit | Cost | Better fit when |
|---|---|---|---|
| Exhaustive dashboard | broad metric coverage | low signal and poor scan speed | expert operators need deep internals view |
| Question-led dashboard | faster diagnosis | some raw metrics move elsewhere | the capstone needs a reviewer-friendly ops story |
Required Artifact
Create a three-question dashboard spec with panels and links to logs/traces.
Case Study 3: STRIDE Review Finds A Real Mitigation
Scenario: The capstone lets users connect a GitHub token. A security section says "use HTTPS" and stops there.
Source anchor: The OWASP Threat Modeling Cheat Sheet gives a practical STRIDE structure for walking assets, boundaries, threats, and mitigations.
Module concepts: STRIDE, token handling, trust boundary, mitigation.
Wrong Approach
Security review is a checklist after implementation.
Better Approach
Walk one threat fully:
Threat:
information disclosure of GitHub token
Mitigation:
encrypt at rest, scoped token, short retention, redact logs
Evidence:
test log redaction and access policy
Failure Mode
The security section stays generic, missing the real trust boundary where sensitive credentials can leak or be over-privileged.
Project / Capstone Connection
Use this for any capstone flow that handles tokens, personal data, or privileged actions across a system boundary.
Tradeoff Table
| Option | Benefit | Cost | Better fit when |
|---|---|---|---|
| Generic checklist review | quick completion | shallow threat coverage | the feature has almost no sensitive data |
| STRIDE-driven review | targeted mitigations | more analysis effort | the capstone crosses meaningful trust boundaries |
Required Artifact
Write a STRIDE table and one mitigation test.
Case Study 4: Backup That Was Never Restored
Scenario: The database has automated backups, but no one has restored one. The final demo depends on that data.
Source anchor: Reliability practice treats recovery as a tested capability, so backup value is only proven when restore steps, timing, and validation have been exercised.
Module concepts: backup, restore, RPO, RTO, drill.
Wrong Approach
Assume backup equals recovery.
Better Approach
Drill:
take backup
restore to separate environment
run smoke test
record RTO/RPO
document failure points
Failure Mode
The first restore attempt uncovers missing permissions, bad procedures, or invalid assumptions after data loss has already happened.
Project / Capstone Connection
Use this if your capstone stores data that would materially affect the demo or any portfolio claim about operational readiness.
Tradeoff Table
| Option | Benefit | Cost | Better fit when |
|---|---|---|---|
| Backup-only posture | low effort | false confidence | data is disposable and easily recreated |
| Restore drill | real recovery evidence | drill time and temp infra cost | the capstone depends on persistent demo data |
Required Artifact
Write a backup/restore drill report with time, data loss window, and validation.
Case Study 5: 3 A.M. Runbook
Scenario: A demo reviewer asks what you would do if ingestion stops. The learner says "check logs."
Source anchor: A useful runbook starts from symptom, lists the next checks in order, gives mitigation actions, and names the escalation boundary.
Module concepts: runbook, incident response, mitigation, escalation.
Wrong Approach
Runbook is a vague troubleshooting paragraph.
Better Approach
Write actionable steps:
Symptom:
ingestion queue age > 10 minutes
Check:
worker health, provider API errors, DB connections
Mitigate:
pause new imports, restart worker, replay failed jobs
Escalate:
provider outage / data corruption
Failure Mode
Under pressure, the operator improvises, misses the fastest checks, and cannot explain where mitigation ends and escalation begins.
Project / Capstone Connection
Use this to prepare the top three incident responses your capstone would plausibly need during review or demo.
Tradeoff Table
| Option | Benefit | Cost | Better fit when |
|---|---|---|---|
| Informal troubleshooting notes | little writing overhead | inconsistent incident response | the system is throwaway and low consequence |
| Structured runbook | repeatable response and clearer defense | upkeep when the system changes | the capstone has real failure paths to explain |
Required Artifact
Write three incident runbooks for the capstone's most likely failures.
Source Map
| Source | Use it for |
|---|---|
| Google SRE SLOs | defining user-centered SLIs, SLOs, and error-budget policy |
| Google SRE monitoring | shaping dashboards around symptoms and operator questions |
| OWASP Threat Modeling Cheat Sheet | structuring STRIDE review around assets, boundaries, and mitigations |
Completion Standard
- One SLO is measured.
- One dashboard answers user-outcome questions.
- STRIDE review has at least one tested mitigation.
- Backup restore is drilled.
- Three runbooks are written.