Module 5: Cloud Security & Observability: Case Studies

These case studies connect threat modeling, identity, secrets, telemetry, alerts, and runbooks into one operational security story.

Case Study 1: STRIDE Threat Model For A Public API

Scenario: A public upload API accepts files, stores metadata, and lets users download processed results. The team wants "security review" but has no structured threat model.

Source anchor: the OWASP Threat Modeling Cheat Sheet, which provides an actionable threat-modeling reference.

Module concepts: STRIDE, trust boundary, data flow diagram, mitigations.

Wrong Approach

Review only code after implementation.

Better Approach

Draw data flows and classify threats:

Spoofing:
  fake user token

Tampering:
  modified object metadata

Repudiation:
  missing audit event

Information disclosure:
  public object URL leak

Tradeoff Table

Choice	Gain	Cost
no model	fast	blind spots
STRIDE review	systematic	facilitation time
checklist only	easy	shallow context
abuse-case testing	realistic	more test work

Required Artifact

Create a data-flow diagram with trust boundaries, STRIDE table, mitigations, and residual risk.

Case Study 2: Static Database Password In App Config

Scenario: A production app uses a long-lived database password stored in an environment variable. The value leaks in a debug dump.

Source anchor: HashiCorp Vault Database secrets engine, which generates dynamic database credentials from configured roles.

Module concepts: secret management, dynamic secrets, rotation, lease, blast radius.

Wrong Approach

Store one shared password forever and rotate only after incidents.

Better Approach

Use dynamic credentials:

app authenticates to vault
vault issues short-lived DB credential
credential expires/revokes
role limits privileges

Tradeoff Table

Choice	Gain	Cost
static secret	simple	long-lived blast radius
dynamic secret	limited exposure	vault dependency
manual rotation	low tooling	drift and outages
automated rotation	safer	integration complexity

Required Artifact

Write a secret lifecycle: creation, delivery, rotation, revocation, audit, and incident response.

Case Study 3: High-Cardinality Metrics Break Observability

Scenario: A service adds user_id as a metric label. Cardinality explodes, costs rise, and dashboards slow down.

Source anchor: OpenTelemetry Semantic Conventions, which standardize attribute names and telemetry meaning.

Module concepts: metrics, labels, cardinality, semantic conventions, traces vs logs.

Wrong Approach

Put every useful field on every metric.

Better Approach

Choose telemetry type by question:

Metric:
  low-cardinality aggregate labels: route, status, region

Trace:
  request-specific attributes and spans

Log:
  detailed event payload with sampling/retention

Tradeoff Table

Telemetry	Gain	Risk
metric	cheap aggregate alerting	cardinality explosion
log	detailed forensic context	volume/cost
trace	request path	sampling gaps
semantic conventions	consistency	mapping discipline

Required Artifact

Write a telemetry schema: metric names, allowed labels, trace attributes, log fields, and banned labels.

Case Study 4: Alert Pages On Cause Instead Of Symptom

Scenario: An alert pages whenever CPU exceeds 80%. At night, batch jobs trigger pages though users are unaffected. During a real checkout outage, CPU stays normal.

Source anchor: Google SRE monitoring guidance recommends paging on symptoms users care about and using cause alerts for tickets/debugging. See Google SRE monitoring distributed systems.

Module concepts: symptom alert, cause alert, SLO, runbook, on-call fatigue.

Wrong Approach

Every scary metric pages.

Better Approach

Page on user-impacting symptoms:

Page:
  checkout success rate below SLO
  p99 latency above SLO

Ticket:
  CPU high but no symptom
  disk trending full in 7 days

Tradeoff Table

Alert type	Gain	Cost
cause page	early signal	noisy
symptom page	user-impact aligned	may be later
ticket alert	planned work	not immediate
no alert	quiet	hidden failure

Required Artifact

Write an alert spec: symptom, threshold, window, severity, runbook, dashboard link, and non-page causes.

Case Study 5: Image Supply Chain Without Provenance

Scenario: A container image is built from latest, pulls unpinned packages, and is deployed without scanning or signing. A vulnerable package ships unnoticed.

Source anchor: SLSA provides a framework for supply-chain integrity levels; Sigstore Cosign supports signing container artifacts. See SLSA framework and Sigstore Cosign.

Module concepts: image hardening, provenance, signing, scanning, SBOM.

Wrong Approach

"It built successfully, so it is safe to deploy."

Better Approach

Add supply-chain controls:

base image pinned
dependencies locked
image scanned
SBOM generated
image signed
deployment verifies signature

Tradeoff Table

Control	Gain	Cost
scan	finds known CVEs	false positives
SBOM	inventory	generation/storage
signing	provenance	keyless/key workflow
pinned base	repeatability	patch update process

Required Artifact

Write a container release security checklist with scan, SBOM, signing, and exception policy.

Source Map

Source	Use it for
OWASP Threat Modeling Cheat Sheet	STRIDE and threat-model process
Vault Database Secrets Engine	dynamic database credentials
OpenTelemetry Semantic Conventions	telemetry attributes and consistency
Google SRE monitoring	symptom vs cause monitoring
SLSA	supply-chain integrity model
Sigstore Cosign	artifact signing

Completion Standard

At least three artifacts are completed.
At least one artifact includes a STRIDE table.
At least one artifact defines a secret lifecycle.
At least one artifact defines alert/runbook behavior.
At least one artifact covers supply-chain controls.

Case Study 1: STRIDE Threat Model For A Public API​

Wrong Approach​

Better Approach​

Tradeoff Table​

Required Artifact​

Case Study 2: Static Database Password In App Config​

Wrong Approach​

Better Approach​

Tradeoff Table​

Required Artifact​

Case Study 3: High-Cardinality Metrics Break Observability​

Wrong Approach​

Better Approach​

Tradeoff Table​

Required Artifact​

Case Study 4: Alert Pages On Cause Instead Of Symptom​

Wrong Approach​

Better Approach​

Tradeoff Table​

Required Artifact​

Case Study 5: Image Supply Chain Without Provenance​

Wrong Approach​

Better Approach​

Tradeoff Table​

Required Artifact​

Source Map​

Completion Standard​

Case Study 1: STRIDE Threat Model For A Public API

Wrong Approach

Better Approach

Tradeoff Table

Required Artifact

Case Study 2: Static Database Password In App Config

Wrong Approach

Better Approach

Tradeoff Table

Required Artifact

Case Study 3: High-Cardinality Metrics Break Observability

Wrong Approach

Better Approach

Tradeoff Table

Required Artifact

Case Study 4: Alert Pages On Cause Instead Of Symptom

Wrong Approach

Better Approach

Tradeoff Table

Required Artifact

Case Study 5: Image Supply Chain Without Provenance

Wrong Approach

Better Approach

Tradeoff Table

Required Artifact

Source Map

Completion Standard