Skip to main content

Module 4: Operational Readiness & Security Review: Case Studies

These case studies turn the capstone into something you can operate and defend under scrutiny.


Case Study 1: SLO For The Critical Path

Scenario: The capstone says "reliable" but has no measured user outcome.

Source anchor: Google SRE's Service Level Objectives frames SLOs as explicit targets on measured SLIs rather than vague reliability claims.

Module concepts: SLI, SLO, error budget, user journey.

Wrong Approach

Use uptime as the only reliability measure.

Better Approach

Define one capstone SLO:

SLI:
successful analysis requests / valid analysis requests

SLO:
99% over 7 days

Consequence:
stop feature work if missed before demo

Failure Mode

The capstone claims reliability without a user-centered metric, so there is no trigger for operational action and nothing concrete to defend in review.

Project / Capstone Connection

Use this to choose the one success measure your capstone should monitor through the final demo period.

Tradeoff Table

OptionBenefitCostBetter fit when
Uptime-only targeteasy to collectweak tie to user outcomethe system is pure infrastructure
User-path SLOmeaningful reliability evidenceneeds clearer instrumentationthe capstone serves a concrete workflow

Required Artifact

Write one SLI/SLO/error-budget policy for the capstone.


Case Study 2: Dashboard That Answers No Question

Scenario: The dashboard has CPU, memory, disk, and request count, but cannot answer "are users succeeding?"

Source anchor: Google SRE monitoring guidance emphasizes symptom-first monitoring: dashboards should answer user and dependency questions before they become metric galleries.

Module concepts: dashboard, symptom, golden signals, runbook.

Wrong Approach

Graph everything the platform exposes.

Better Approach

Answer three questions:

Can users complete the critical path?
Is the system slow?
What dependency is causing failures?

Failure Mode

An incident starts, but the dashboard cannot distinguish user harm from infrastructure noise, so diagnosis stalls and the wrong fixes get attempted.

Project / Capstone Connection

Use this when deciding which panels belong on the one dashboard you will actually open during a capstone incident or defense.

Tradeoff Table

OptionBenefitCostBetter fit when
Exhaustive dashboardbroad metric coveragelow signal and poor scan speedexpert operators need deep internals view
Question-led dashboardfaster diagnosissome raw metrics move elsewherethe capstone needs a reviewer-friendly ops story

Required Artifact

Create a three-question dashboard spec with panels and links to logs/traces.


Case Study 3: STRIDE Review Finds A Real Mitigation

Scenario: The capstone lets users connect a GitHub token. A security section says "use HTTPS" and stops there.

Source anchor: The OWASP Threat Modeling Cheat Sheet gives a practical STRIDE structure for walking assets, boundaries, threats, and mitigations.

Module concepts: STRIDE, token handling, trust boundary, mitigation.

Wrong Approach

Security review is a checklist after implementation.

Better Approach

Walk one threat fully:

Threat:
information disclosure of GitHub token

Mitigation:
encrypt at rest, scoped token, short retention, redact logs

Evidence:
test log redaction and access policy

Failure Mode

The security section stays generic, missing the real trust boundary where sensitive credentials can leak or be over-privileged.

Project / Capstone Connection

Use this for any capstone flow that handles tokens, personal data, or privileged actions across a system boundary.

Tradeoff Table

OptionBenefitCostBetter fit when
Generic checklist reviewquick completionshallow threat coveragethe feature has almost no sensitive data
STRIDE-driven reviewtargeted mitigationsmore analysis effortthe capstone crosses meaningful trust boundaries

Required Artifact

Write a STRIDE table and one mitigation test.


Case Study 4: Backup That Was Never Restored

Scenario: The database has automated backups, but no one has restored one. The final demo depends on that data.

Source anchor: Reliability practice treats recovery as a tested capability, so backup value is only proven when restore steps, timing, and validation have been exercised.

Module concepts: backup, restore, RPO, RTO, drill.

Wrong Approach

Assume backup equals recovery.

Better Approach

Drill:

take backup
restore to separate environment
run smoke test
record RTO/RPO
document failure points

Failure Mode

The first restore attempt uncovers missing permissions, bad procedures, or invalid assumptions after data loss has already happened.

Project / Capstone Connection

Use this if your capstone stores data that would materially affect the demo or any portfolio claim about operational readiness.

Tradeoff Table

OptionBenefitCostBetter fit when
Backup-only posturelow effortfalse confidencedata is disposable and easily recreated
Restore drillreal recovery evidencedrill time and temp infra costthe capstone depends on persistent demo data

Required Artifact

Write a backup/restore drill report with time, data loss window, and validation.


Case Study 5: 3 A.M. Runbook

Scenario: A demo reviewer asks what you would do if ingestion stops. The learner says "check logs."

Source anchor: A useful runbook starts from symptom, lists the next checks in order, gives mitigation actions, and names the escalation boundary.

Module concepts: runbook, incident response, mitigation, escalation.

Wrong Approach

Runbook is a vague troubleshooting paragraph.

Better Approach

Write actionable steps:

Symptom:
ingestion queue age > 10 minutes

Check:
worker health, provider API errors, DB connections

Mitigate:
pause new imports, restart worker, replay failed jobs

Escalate:
provider outage / data corruption

Failure Mode

Under pressure, the operator improvises, misses the fastest checks, and cannot explain where mitigation ends and escalation begins.

Project / Capstone Connection

Use this to prepare the top three incident responses your capstone would plausibly need during review or demo.

Tradeoff Table

OptionBenefitCostBetter fit when
Informal troubleshooting noteslittle writing overheadinconsistent incident responsethe system is throwaway and low consequence
Structured runbookrepeatable response and clearer defenseupkeep when the system changesthe capstone has real failure paths to explain

Required Artifact

Write three incident runbooks for the capstone's most likely failures.


Source Map

SourceUse it for
Google SRE SLOsdefining user-centered SLIs, SLOs, and error-budget policy
Google SRE monitoringshaping dashboards around symptoms and operator questions
OWASP Threat Modeling Cheat Sheetstructuring STRIDE review around assets, boundaries, and mitigations

Completion Standard

  • One SLO is measured.
  • One dashboard answers user-outcome questions.
  • STRIDE review has at least one tested mitigation.
  • Backup restore is drilled.
  • Three runbooks are written.