Skip to main content

Module 5: Cloud Security & Observability: Case Studies

These case studies connect threat modeling, identity, secrets, telemetry, alerts, and runbooks into one operational security story.


Case Study 1: STRIDE Threat Model For A Public API

Scenario: A public upload API accepts files, stores metadata, and lets users download processed results. The team wants "security review" but has no structured threat model.

Source anchor: the OWASP Threat Modeling Cheat Sheet, which provides an actionable threat-modeling reference.

Module concepts: STRIDE, trust boundary, data flow diagram, mitigations.

Wrong Approach

Review only code after implementation.

Better Approach

Draw data flows and classify threats:

Spoofing:
fake user token

Tampering:
modified object metadata

Repudiation:
missing audit event

Information disclosure:
public object URL leak

Tradeoff Table

ChoiceGainCost
no modelfastblind spots
STRIDE reviewsystematicfacilitation time
checklist onlyeasyshallow context
abuse-case testingrealisticmore test work

Required Artifact

Create a data-flow diagram with trust boundaries, STRIDE table, mitigations, and residual risk.


Case Study 2: Static Database Password In App Config

Scenario: A production app uses a long-lived database password stored in an environment variable. The value leaks in a debug dump.

Source anchor: HashiCorp Vault Database secrets engine, which generates dynamic database credentials from configured roles.

Module concepts: secret management, dynamic secrets, rotation, lease, blast radius.

Wrong Approach

Store one shared password forever and rotate only after incidents.

Better Approach

Use dynamic credentials:

app authenticates to vault
vault issues short-lived DB credential
credential expires/revokes
role limits privileges

Tradeoff Table

ChoiceGainCost
static secretsimplelong-lived blast radius
dynamic secretlimited exposurevault dependency
manual rotationlow toolingdrift and outages
automated rotationsaferintegration complexity

Required Artifact

Write a secret lifecycle: creation, delivery, rotation, revocation, audit, and incident response.


Case Study 3: High-Cardinality Metrics Break Observability

Scenario: A service adds user_id as a metric label. Cardinality explodes, costs rise, and dashboards slow down.

Source anchor: OpenTelemetry Semantic Conventions, which standardize attribute names and telemetry meaning.

Module concepts: metrics, labels, cardinality, semantic conventions, traces vs logs.

Wrong Approach

Put every useful field on every metric.

Better Approach

Choose telemetry type by question:

Metric:
low-cardinality aggregate labels: route, status, region

Trace:
request-specific attributes and spans

Log:
detailed event payload with sampling/retention

Tradeoff Table

TelemetryGainRisk
metriccheap aggregate alertingcardinality explosion
logdetailed forensic contextvolume/cost
tracerequest pathsampling gaps
semantic conventionsconsistencymapping discipline

Required Artifact

Write a telemetry schema: metric names, allowed labels, trace attributes, log fields, and banned labels.


Case Study 4: Alert Pages On Cause Instead Of Symptom

Scenario: An alert pages whenever CPU exceeds 80%. At night, batch jobs trigger pages though users are unaffected. During a real checkout outage, CPU stays normal.

Source anchor: Google SRE monitoring guidance recommends paging on symptoms users care about and using cause alerts for tickets/debugging. See Google SRE monitoring distributed systems.

Module concepts: symptom alert, cause alert, SLO, runbook, on-call fatigue.

Wrong Approach

Every scary metric pages.

Better Approach

Page on user-impacting symptoms:

Page:
checkout success rate below SLO
p99 latency above SLO

Ticket:
CPU high but no symptom
disk trending full in 7 days

Tradeoff Table

Alert typeGainCost
cause pageearly signalnoisy
symptom pageuser-impact alignedmay be later
ticket alertplanned worknot immediate
no alertquiethidden failure

Required Artifact

Write an alert spec: symptom, threshold, window, severity, runbook, dashboard link, and non-page causes.


Case Study 5: Image Supply Chain Without Provenance

Scenario: A container image is built from latest, pulls unpinned packages, and is deployed without scanning or signing. A vulnerable package ships unnoticed.

Source anchor: SLSA provides a framework for supply-chain integrity levels; Sigstore Cosign supports signing container artifacts. See SLSA framework and Sigstore Cosign.

Module concepts: image hardening, provenance, signing, scanning, SBOM.

Wrong Approach

"It built successfully, so it is safe to deploy."

Better Approach

Add supply-chain controls:

base image pinned
dependencies locked
image scanned
SBOM generated
image signed
deployment verifies signature

Tradeoff Table

ControlGainCost
scanfinds known CVEsfalse positives
SBOMinventorygeneration/storage
signingprovenancekeyless/key workflow
pinned baserepeatabilitypatch update process

Required Artifact

Write a container release security checklist with scan, SBOM, signing, and exception policy.


Source Map

SourceUse it for
OWASP Threat Modeling Cheat SheetSTRIDE and threat-model process
Vault Database Secrets Enginedynamic database credentials
OpenTelemetry Semantic Conventionstelemetry attributes and consistency
Google SRE monitoringsymptom vs cause monitoring
SLSAsupply-chain integrity model
Sigstore Cosignartifact signing

Completion Standard

  • At least three artifacts are completed.
  • At least one artifact includes a STRIDE table.
  • At least one artifact defines a secret lifecycle.
  • At least one artifact defines alert/runbook behavior.
  • At least one artifact covers supply-chain controls.