Module 5: Cloud Security & Observability: Case Studies
These case studies connect threat modeling, identity, secrets, telemetry, alerts, and runbooks into one operational security story.
Case Study 1: STRIDE Threat Model For A Public API
Scenario: A public upload API accepts files, stores metadata, and lets users download processed results. The team wants "security review" but has no structured threat model.
Source anchor: the OWASP Threat Modeling Cheat Sheet, which provides an actionable threat-modeling reference.
Module concepts: STRIDE, trust boundary, data flow diagram, mitigations.
Wrong Approach
Review only code after implementation.
Better Approach
Draw data flows and classify threats:
Spoofing:
fake user token
Tampering:
modified object metadata
Repudiation:
missing audit event
Information disclosure:
public object URL leak
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| no model | fast | blind spots |
| STRIDE review | systematic | facilitation time |
| checklist only | easy | shallow context |
| abuse-case testing | realistic | more test work |
Required Artifact
Create a data-flow diagram with trust boundaries, STRIDE table, mitigations, and residual risk.
Case Study 2: Static Database Password In App Config
Scenario: A production app uses a long-lived database password stored in an environment variable. The value leaks in a debug dump.
Source anchor: HashiCorp Vault Database secrets engine, which generates dynamic database credentials from configured roles.
Module concepts: secret management, dynamic secrets, rotation, lease, blast radius.
Wrong Approach
Store one shared password forever and rotate only after incidents.
Better Approach
Use dynamic credentials:
app authenticates to vault
vault issues short-lived DB credential
credential expires/revokes
role limits privileges
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| static secret | simple | long-lived blast radius |
| dynamic secret | limited exposure | vault dependency |
| manual rotation | low tooling | drift and outages |
| automated rotation | safer | integration complexity |
Required Artifact
Write a secret lifecycle: creation, delivery, rotation, revocation, audit, and incident response.
Case Study 3: High-Cardinality Metrics Break Observability
Scenario: A service adds user_id as a metric label. Cardinality explodes, costs rise, and dashboards slow down.
Source anchor: OpenTelemetry Semantic Conventions, which standardize attribute names and telemetry meaning.
Module concepts: metrics, labels, cardinality, semantic conventions, traces vs logs.
Wrong Approach
Put every useful field on every metric.
Better Approach
Choose telemetry type by question:
Metric:
low-cardinality aggregate labels: route, status, region
Trace:
request-specific attributes and spans
Log:
detailed event payload with sampling/retention
Tradeoff Table
| Telemetry | Gain | Risk |
|---|---|---|
| metric | cheap aggregate alerting | cardinality explosion |
| log | detailed forensic context | volume/cost |
| trace | request path | sampling gaps |
| semantic conventions | consistency | mapping discipline |
Required Artifact
Write a telemetry schema: metric names, allowed labels, trace attributes, log fields, and banned labels.
Case Study 4: Alert Pages On Cause Instead Of Symptom
Scenario: An alert pages whenever CPU exceeds 80%. At night, batch jobs trigger pages though users are unaffected. During a real checkout outage, CPU stays normal.
Source anchor: Google SRE monitoring guidance recommends paging on symptoms users care about and using cause alerts for tickets/debugging. See Google SRE monitoring distributed systems.
Module concepts: symptom alert, cause alert, SLO, runbook, on-call fatigue.
Wrong Approach
Every scary metric pages.
Better Approach
Page on user-impacting symptoms:
Page:
checkout success rate below SLO
p99 latency above SLO
Ticket:
CPU high but no symptom
disk trending full in 7 days
Tradeoff Table
| Alert type | Gain | Cost |
|---|---|---|
| cause page | early signal | noisy |
| symptom page | user-impact aligned | may be later |
| ticket alert | planned work | not immediate |
| no alert | quiet | hidden failure |
Required Artifact
Write an alert spec: symptom, threshold, window, severity, runbook, dashboard link, and non-page causes.
Case Study 5: Image Supply Chain Without Provenance
Scenario: A container image is built from latest, pulls unpinned packages, and is deployed without scanning or signing. A vulnerable package ships unnoticed.
Source anchor: SLSA provides a framework for supply-chain integrity levels; Sigstore Cosign supports signing container artifacts. See SLSA framework and Sigstore Cosign.
Module concepts: image hardening, provenance, signing, scanning, SBOM.
Wrong Approach
"It built successfully, so it is safe to deploy."
Better Approach
Add supply-chain controls:
base image pinned
dependencies locked
image scanned
SBOM generated
image signed
deployment verifies signature
Tradeoff Table
| Control | Gain | Cost |
|---|---|---|
| scan | finds known CVEs | false positives |
| SBOM | inventory | generation/storage |
| signing | provenance | keyless/key workflow |
| pinned base | repeatability | patch update process |
Required Artifact
Write a container release security checklist with scan, SBOM, signing, and exception policy.
Source Map
| Source | Use it for |
|---|---|
| OWASP Threat Modeling Cheat Sheet | STRIDE and threat-model process |
| Vault Database Secrets Engine | dynamic database credentials |
| OpenTelemetry Semantic Conventions | telemetry attributes and consistency |
| Google SRE monitoring | symptom vs cause monitoring |
| SLSA | supply-chain integrity model |
| Sigstore Cosign | artifact signing |
Completion Standard
- At least three artifacts are completed.
- At least one artifact includes a STRIDE table.
- At least one artifact defines a secret lifecycle.
- At least one artifact defines alert/runbook behavior.
- At least one artifact covers supply-chain controls.