Module 4: Operational Readiness & Security Review: Case Studies

These case studies turn the capstone into something you can operate and defend under scrutiny.

Case Study 1: SLO For The Critical Path

Scenario: The capstone says "reliable" but has no measured user outcome.

Source anchor: Google SRE's Service Level Objectives frames SLOs as explicit targets on measured SLIs rather than vague reliability claims.

Module concepts: SLI, SLO, error budget, user journey.

Wrong Approach

Use uptime as the only reliability measure.

Better Approach

Define one capstone SLO:

SLI:
  successful analysis requests / valid analysis requests

SLO:
  99% over 7 days

Consequence:
  stop feature work if missed before demo

Failure Mode

The capstone claims reliability without a user-centered metric, so there is no trigger for operational action and nothing concrete to defend in review.

Project / Capstone Connection

Use this to choose the one success measure your capstone should monitor through the final demo period.

Tradeoff Table

Option	Benefit	Cost	Better fit when
Uptime-only target	easy to collect	weak tie to user outcome	the system is pure infrastructure
User-path SLO	meaningful reliability evidence	needs clearer instrumentation	the capstone serves a concrete workflow

Required Artifact

Write one SLI/SLO/error-budget policy for the capstone.

Case Study 2: Dashboard That Answers No Question

Scenario: The dashboard has CPU, memory, disk, and request count, but cannot answer "are users succeeding?"

Source anchor: Google SRE monitoring guidance emphasizes symptom-first monitoring: dashboards should answer user and dependency questions before they become metric galleries.

Module concepts: dashboard, symptom, golden signals, runbook.

Wrong Approach

Graph everything the platform exposes.

Better Approach

Answer three questions:

Can users complete the critical path?
Is the system slow?
What dependency is causing failures?

Failure Mode

An incident starts, but the dashboard cannot distinguish user harm from infrastructure noise, so diagnosis stalls and the wrong fixes get attempted.

Project / Capstone Connection

Use this when deciding which panels belong on the one dashboard you will actually open during a capstone incident or defense.

Tradeoff Table

Option	Benefit	Cost	Better fit when
Exhaustive dashboard	broad metric coverage	low signal and poor scan speed	expert operators need deep internals view
Question-led dashboard	faster diagnosis	some raw metrics move elsewhere	the capstone needs a reviewer-friendly ops story

Required Artifact

Create a three-question dashboard spec with panels and links to logs/traces.

Case Study 3: STRIDE Review Finds A Real Mitigation

Scenario: The capstone lets users connect a GitHub token. A security section says "use HTTPS" and stops there.

Source anchor: The OWASP Threat Modeling Cheat Sheet gives a practical STRIDE structure for walking assets, boundaries, threats, and mitigations.

Module concepts: STRIDE, token handling, trust boundary, mitigation.

Wrong Approach

Security review is a checklist after implementation.

Better Approach

Walk one threat fully:

Threat:
  information disclosure of GitHub token

Mitigation:
  encrypt at rest, scoped token, short retention, redact logs

Evidence:
  test log redaction and access policy

Failure Mode

The security section stays generic, missing the real trust boundary where sensitive credentials can leak or be over-privileged.

Project / Capstone Connection

Use this for any capstone flow that handles tokens, personal data, or privileged actions across a system boundary.

Tradeoff Table

Option	Benefit	Cost	Better fit when
Generic checklist review	quick completion	shallow threat coverage	the feature has almost no sensitive data
STRIDE-driven review	targeted mitigations	more analysis effort	the capstone crosses meaningful trust boundaries

Required Artifact

Write a STRIDE table and one mitigation test.

Case Study 4: Backup That Was Never Restored

Scenario: The database has automated backups, but no one has restored one. The final demo depends on that data.

Source anchor: Reliability practice treats recovery as a tested capability, so backup value is only proven when restore steps, timing, and validation have been exercised.

Module concepts: backup, restore, RPO, RTO, drill.

Wrong Approach

Assume backup equals recovery.

Better Approach

Drill:

take backup
restore to separate environment
run smoke test
record RTO/RPO
document failure points

Failure Mode

The first restore attempt uncovers missing permissions, bad procedures, or invalid assumptions after data loss has already happened.

Project / Capstone Connection

Use this if your capstone stores data that would materially affect the demo or any portfolio claim about operational readiness.

Tradeoff Table

Option	Benefit	Cost	Better fit when
Backup-only posture	low effort	false confidence	data is disposable and easily recreated
Restore drill	real recovery evidence	drill time and temp infra cost	the capstone depends on persistent demo data

Required Artifact

Write a backup/restore drill report with time, data loss window, and validation.

Case Study 5: 3 A.M. Runbook

Scenario: A demo reviewer asks what you would do if ingestion stops. The learner says "check logs."

Source anchor: A useful runbook starts from symptom, lists the next checks in order, gives mitigation actions, and names the escalation boundary.

Module concepts: runbook, incident response, mitigation, escalation.

Wrong Approach

Runbook is a vague troubleshooting paragraph.

Better Approach

Write actionable steps:

Symptom:
  ingestion queue age > 10 minutes

Check:
  worker health, provider API errors, DB connections

Mitigate:
  pause new imports, restart worker, replay failed jobs

Escalate:
  provider outage / data corruption

Failure Mode

Under pressure, the operator improvises, misses the fastest checks, and cannot explain where mitigation ends and escalation begins.

Project / Capstone Connection

Use this to prepare the top three incident responses your capstone would plausibly need during review or demo.

Tradeoff Table

Option	Benefit	Cost	Better fit when
Informal troubleshooting notes	little writing overhead	inconsistent incident response	the system is throwaway and low consequence
Structured runbook	repeatable response and clearer defense	upkeep when the system changes	the capstone has real failure paths to explain

Required Artifact

Write three incident runbooks for the capstone's most likely failures.

Source Map

Source	Use it for
Google SRE SLOs	defining user-centered SLIs, SLOs, and error-budget policy
Google SRE monitoring	shaping dashboards around symptoms and operator questions
OWASP Threat Modeling Cheat Sheet	structuring STRIDE review around assets, boundaries, and mitigations

Completion Standard

One SLO is measured.
One dashboard answers user-outcome questions.
STRIDE review has at least one tested mitigation.
Backup restore is drilled.
Three runbooks are written.

Case Study 1: SLO For The Critical Path​

Wrong Approach​

Better Approach​

Failure Mode​

Project / Capstone Connection​

Tradeoff Table​

Required Artifact​

Case Study 2: Dashboard That Answers No Question​

Wrong Approach​

Better Approach​

Failure Mode​

Project / Capstone Connection​

Tradeoff Table​

Required Artifact​

Case Study 3: STRIDE Review Finds A Real Mitigation​

Wrong Approach​

Better Approach​

Failure Mode​

Project / Capstone Connection​

Tradeoff Table​

Required Artifact​

Case Study 4: Backup That Was Never Restored​

Wrong Approach​

Better Approach​

Failure Mode​

Project / Capstone Connection​

Tradeoff Table​

Required Artifact​

Case Study 5: 3 A.M. Runbook​

Wrong Approach​

Better Approach​

Failure Mode​

Project / Capstone Connection​

Tradeoff Table​

Required Artifact​

Source Map​

Completion Standard​

Case Study 1: SLO For The Critical Path

Wrong Approach

Better Approach

Failure Mode

Project / Capstone Connection

Tradeoff Table

Required Artifact

Case Study 2: Dashboard That Answers No Question

Wrong Approach

Better Approach

Failure Mode

Project / Capstone Connection

Tradeoff Table

Required Artifact

Case Study 3: STRIDE Review Finds A Real Mitigation

Wrong Approach

Better Approach

Failure Mode

Project / Capstone Connection

Tradeoff Table

Required Artifact

Case Study 4: Backup That Was Never Restored

Wrong Approach

Better Approach

Failure Mode

Project / Capstone Connection

Tradeoff Table

Required Artifact

Case Study 5: 3 A.M. Runbook

Wrong Approach

Better Approach

Failure Mode

Project / Capstone Connection

Tradeoff Table

Required Artifact

Source Map

Completion Standard