Module 5: Cloud Security & Observability

Primary teacher: this guide and the linked official docs (OWASP, NIST, AWS / GCP / Azure Well-Architected, HashiCorp Vault, OpenTelemetry, Prometheus, Google SRE book). Selective support: the local semester book chunks (Pro Git, The Linux Command Line, Git from the Bottom Up) only where they sharpen operational habits. These books are loosely related; this module leans on external references on purpose.

This guide is the primary teacher. You do not need to read any full book to finish this module. You do need to become operationally strong at threat modeling cloud systems, managing secrets and keys, instrumenting services with metrics, logs, and traces, and operating them under real alerts.

Scope of This Module

This is where engineering turns into operations. The earlier modules in this semester got a system deployed; this module makes it safe to run and observable under pressure.

What it covers in depth:

threat modeling using STRIDE against real cloud services
identity-centric security and the "IAM is the new perimeter" mindset
defense in depth across network, host, app, and data layers
secret management: vaults, dynamic secrets, and rotation workflows
envelope encryption with data keys and key-encryption keys, and the difference between at-rest and in-transit
data classification and minimization as reusable engineering habits
network controls: security groups, NACLs, and VPC endpoints as the "network moat"
image hardening, minimal base images, and supply-chain scanning
runtime detection and response tooling (CSPM, CWPP) at a conceptual level
metrics, structured logs, and distributed tracing as the three observability pillars
sampling strategies, cardinality control, and the semantic conventions that make OpenTelemetry data useful
dashboards that answer a question, alerts that page on symptoms, and runbooks that are still useful at 3 a.m.

What it deliberately does not try to finish here:

full cryptographic engineering beyond what operators need
SIEM / SOAR tooling deep dives or any specific vendor deep dive
incident management as an organizational discipline (that lives in Semester 10)
capacity planning or chaos engineering beyond brief motivation

Before You Start

Answer these closed-book before starting the main path:

What is the difference between authentication and authorization, and which one is usually wrong in a cloud breach?
What does "defense in depth" mean concretely on a 3-tier web app that also holds customer data?
Why is a hard-coded API key in a source repo a different class of problem than a misconfigured firewall rule?
What does TLS protect, and what does it not protect, about the traffic between two services?
Why is a metric with per-user labels often worse than no metric at all?
What is the difference between an alert that pages on a cause and an alert that pages on a symptom?

Diagnostic Interpretation

5-6 solid answers

You are ready for the full path.

3-4 solid answers

Continue, but expect extra time in the secrets/encryption and observability pillars clusters.

0-2 solid answers

Revisit Module 1 (cloud platform fundamentals) and Module 3 (container orchestration) and the security / observability tracks in Semester 8 Module 4 before this module, or be ready to slow down deliberately.

What This Module Is For

Production systems fail. The cheap failures are loud: a CPU pegs, a deployment crashes, a 500 rate spikes. The expensive failures are silent: a leaked credential is used for six weeks before anyone notices, a misrouted log stream dumps secrets into a low-trust bucket, a dashboard says everything is green while a customer-visible path has been broken for two days.

This module builds the reasoning and tooling for the second category:

threat modeling so you know what you are defending and from whom
identity, secrets, encryption, and network controls so that "what if someone got in here" has a non-catastrophic answer
metrics, logs, and traces so that you can tell the difference between a working system and a dead one, quickly
alerts, dashboards, and runbooks so that the humans on call can act correctly under fatigue

You are learning to run systems that someone actually trusts.

Local Security and Observability Track

Do not buy a managed security or observability platform just to learn the control. Start locally unless provider integration is the specific lesson.

Run STRIDE and data-flow threat-model exercises against diagrams and pull requests before creating resources.
Use local or mocked secrets flows for practice: Docker secrets, sealed .env.example files, a local Vault dev server, or scripted fake dynamic credentials are acceptable when the evidence is the rotation path and access boundary.
Export OpenTelemetry traces to a local Collector, console exporter, Jaeger, or Grafana Tempo; export metrics to Prometheus; route logs to stdout, files, Loki, or another local sink.
Practice detection with local attacks and mistakes: bad IAM policy examples, overbroad Kubernetes RBAC, leaked test secrets caught by scanners, log redaction failures, and high-cardinality metrics.
Use paid cloud security, logging, tracing, or metrics services only after billing alerts, retention limits, least-privilege roles, and teardown/cleanup steps are documented.

Concept Map

How To Use This Module

Work in order. Security and observability are siblings: each fails in the dark, and each relies on the same engineering habits of naming things correctly and paying attention to what you already have.

Cluster 1: Cloud Security Foundations

Order	Concept	Type	Focus
1	Threat Modeling (STRIDE) for Cloud Services	PRIMARY	A repeatable way to ask "what can go wrong here" with real examples per STRIDE letter
2	Identity-Centric Security: The New Perimeter	PRIMARY	Why IAM, not the firewall, is the boundary in cloud systems
3	Defense in Depth: Network, Host, App, Data Layers	PRIMARY	Stacking imperfect controls so no single failure is catastrophic

Cluster mastery check: Can you run a STRIDE pass on a 3-service system and name one real control per letter?

Cluster 2: Secrets, Keys, and Data

Order	Concept	Type	Focus
4	Secret Management: Vaults, Dynamic Secrets, Rotation	PRIMARY	Getting secrets out of source, out of env files, and into a managed lifecycle
5	Encryption: At-Rest, In-Transit, and KMS Envelope Encryption	PRIMARY	DEK/KEK flow, what each layer actually protects, and where it fails
6	Data Classification and Minimization	PRIMARY	Treating data as a liability; the habit of asking "do we even need this"

Cluster mastery check: Can you diagram the envelope-encryption flow from memory and explain why rotating the KEK does not require re-encrypting the data?

Cluster 3: Network and Runtime Security

Order	Concept	Type	Focus
7	Security Groups, NACLs, and VPC Endpoints: The Network Moat	PRIMARY	What each control actually filters and why "the moat" is now deep but narrow
8	Image Hardening and Supply-Chain Scanning	PRIMARY	Minimal base images, provenance, signing, and vulnerability scanning tied to SLSA
9	Runtime Detection and Response: CSPM and CWPP	PRIMARY	Posture vs workload protection, and why both are needed

Cluster mastery check: Can you pick the correct network control for a given goal, and explain why image hardening complements runtime detection instead of replacing it?

Cluster 4: Observability Pillars in Cloud

Order	Concept	Type	Focus
10	Metrics: Cardinality, Exemplars, and USE/RED in Cloud-Native	PRIMARY	Why high-cardinality labels are a production hazard and what to use instead
11	Structured Logging and Log Routing	PRIMARY	JSON logs with stable keys, routing pipelines, and redaction
12	Distributed Tracing: OpenTelemetry and Sampling Strategies	PRIMARY	Spans, context propagation, semantic conventions, and head vs tail sampling

Cluster mastery check: Can you instrument a single endpoint with a metric, a structured log line, and a span, each using stable naming?

Cluster 5: Operating Under Observation

Order	Concept	Type	Focus
13	Dashboards That Answer Questions, Not Decorations	PRIMARY	Designing dashboards around the questions on-call asks, not the metrics you have
14	Alerting on Symptoms, Not Causes; The Silent Runner Problem	PRIMARY	Symptoms vs causes, with 2 bad and 2 good alert examples
15	Runbooks and On-Call Hygiene	SUPPORTING	Writing runbooks that survive a tired operator at 3 a.m.

Cluster mastery check: Can you take a real failure mode and produce a one-page runbook that a teammate could execute without you on the call?

Practice Path

Order	Practice page	Focus
1	Threat Modeling Lab	Run STRIDE on a real architecture and produce mitigations
2	Secrets and Encryption Workshop	Vault workflows, envelope encryption, and rotation
3	Observability Design Clinic	Metrics, logs, traces, dashboards, alerts end-to-end
4	Security / Observability Code Katas	Five katas combining threat modeling, secrets, tracing, and alerting

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.

Learning Objectives

By the end of this module you should be able to:

Run a STRIDE threat-modeling pass on a cloud system and produce at least one concrete mitigation per letter.
Describe why identity is the new perimeter and design IAM policies using least privilege.
Apply defense in depth across network, host, application, and data layers and explain what each layer protects against.
Manage secrets using a vault with dynamic credentials and a documented rotation path.
Explain envelope encryption using a DEK/KEK diagram and justify where it is used and where it is not sufficient.
Classify data by sensitivity and apply minimization before storage.
Choose the correct network control (security group, NACL, VPC endpoint) for a given requirement.
Harden a container image, scan it for vulnerabilities, and reason about supply-chain integrity (SLSA, signing).
Instrument a service with metrics, structured logs, and OpenTelemetry traces using stable semantic conventions.
Write alerts that page on symptoms rather than causes and justify each with an SLO or user-visible impact.
Write a runbook that is usable at 3 a.m. by a teammate who has never seen the system.

Outputs

one threat model for a 3-service system, one mitigation per STRIDE letter
one working demo of secret retrieval and rotation through a vault-style flow (even if mocked locally)
one envelope-encryption diagram drawn from memory with the DEK/KEK lifecycle explained in sentences
one data classification sheet for the same 3-service system
one network-control decision table: security group vs NACL vs VPC endpoint
one minimal container image spec, scanned, with a written list of findings resolved vs accepted
one instrumented service spec with three named metrics, one structured log schema, and one traced endpoint using semantic conventions
one dashboard spec answering "is the service healthy for users right now" with no more than six panels
one alert spec set (at least 4 alerts), each tagged as symptom or cause with justification
one runbook for a real failure mode with entry conditions, diagnostic steps, mitigations, and a rollback step

Completion Standard

You have completed Module 5 when all of these are true:

you can walk through STRIDE on a cloud system live, not from notes
you can distinguish authentication, authorization, identity boundary, and network boundary and explain which one a given attack targets
you can diagram envelope encryption and defend the design choices
you can classify data, choose a secret store, and design a rotation workflow
you can instrument a service with metrics, logs, and traces using stable names
you can look at an alert spec and say whether it is a symptom or a cause alert and why
you can hand a runbook to a peer and have them execute it without you on the call

Reading Policy

Concept pages are the main path. External official docs (OWASP, NIST, cloud providers, OpenTelemetry, Prometheus, Google SRE book) are the main escalation.
Book chunks (Pro Git, The Linux Command Line) are loosely relevant and are referenced only where they sharpen operational habits.
See also (external) means "open one of these if the concept page is not enough", not "read the entire source".
If you open three external links and still cannot explain the concept in your own words, write down the specific blocker before continuing.

Suggested Weekly Flow

Day	Work
1	Concepts 1-3 and one STRIDE pass on a real or real-shaped system
2	Concepts 4-6 and the secrets/encryption workshop setup
3	Concepts 7-9 and one image-hardening experiment
4	Concepts 10-12 and one instrumented endpoint with metrics, logs, and a trace
5	Concepts 13-15 and the dashboard/alert/runbook triple
6	Practice pages 1-2 and code katas 1-3
7	Practice pages 3-4, remaining katas, quiz, and mistake-log cleanup

Reference

Use Reference and Selective Reading only when a concept page plus one external link have not resolved the gap. This module is external-reference-heavy on purpose; it matches how you will actually work.

Rich Learning Pages

Model Artifact Calibration

For operational readiness evidence, compare your procedure to the runbook model artifact.

Scope of This Module​

Before You Start​

Diagnostic Interpretation​

What This Module Is For​

Local Security and Observability Track​

Concept Map​

How To Use This Module​

Cluster 1: Cloud Security Foundations​

Cluster 2: Secrets, Keys, and Data​

Cluster 3: Network and Runtime Security​

Cluster 4: Observability Pillars in Cloud​

Cluster 5: Operating Under Observation​

Practice Path​

Learning Objectives​

Outputs​

Completion Standard​

Reading Policy​

Suggested Weekly Flow​

Reference​

Rich Learning Pages​

Model Artifact Calibration​