Skip to main content

Module 5: Cloud Security & Observability

Primary teacher: this guide and the linked official docs (OWASP, NIST, AWS / GCP / Azure Well-Architected, HashiCorp Vault, OpenTelemetry, Prometheus, Google SRE book). Selective support: the local semester book chunks (Pro Git, The Linux Command Line, Git from the Bottom Up) only where they sharpen operational habits. These books are loosely related; this module leans on external references on purpose.

This guide is the primary teacher. You do not need to read any full book to finish this module. You do need to become operationally strong at threat modeling cloud systems, managing secrets and keys, instrumenting services with metrics, logs, and traces, and operating them under real alerts.


Scope of This Module

This is where engineering turns into operations. The earlier modules in this semester got a system deployed; this module makes it safe to run and observable under pressure.

What it covers in depth:

  • threat modeling using STRIDE against real cloud services
  • identity-centric security and the "IAM is the new perimeter" mindset
  • defense in depth across network, host, app, and data layers
  • secret management: vaults, dynamic secrets, and rotation workflows
  • envelope encryption with data keys and key-encryption keys, and the difference between at-rest and in-transit
  • data classification and minimization as reusable engineering habits
  • network controls: security groups, NACLs, and VPC endpoints as the "network moat"
  • image hardening, minimal base images, and supply-chain scanning
  • runtime detection and response tooling (CSPM, CWPP) at a conceptual level
  • metrics, structured logs, and distributed tracing as the three observability pillars
  • sampling strategies, cardinality control, and the semantic conventions that make OpenTelemetry data useful
  • dashboards that answer a question, alerts that page on symptoms, and runbooks that are still useful at 3 a.m.

What it deliberately does not try to finish here:

  • full cryptographic engineering beyond what operators need
  • SIEM / SOAR tooling deep dives or any specific vendor deep dive
  • incident management as an organizational discipline (that lives in Semester 10)
  • capacity planning or chaos engineering beyond brief motivation

Before You Start

Answer these closed-book before starting the main path:

  1. What is the difference between authentication and authorization, and which one is usually wrong in a cloud breach?
  2. What does "defense in depth" mean concretely on a 3-tier web app that also holds customer data?
  3. Why is a hard-coded API key in a source repo a different class of problem than a misconfigured firewall rule?
  4. What does TLS protect, and what does it not protect, about the traffic between two services?
  5. Why is a metric with per-user labels often worse than no metric at all?
  6. What is the difference between an alert that pages on a cause and an alert that pages on a symptom?

Diagnostic Interpretation

5-6 solid answers

  • You are ready for the full path.

3-4 solid answers

  • Continue, but expect extra time in the secrets/encryption and observability pillars clusters.

0-2 solid answers

  • Revisit Module 1 (cloud platform fundamentals) and Module 3 (container orchestration) and the security / observability tracks in Semester 8 Module 4 before this module, or be ready to slow down deliberately.

What This Module Is For

Production systems fail. The cheap failures are loud: a CPU pegs, a deployment crashes, a 500 rate spikes. The expensive failures are silent: a leaked credential is used for six weeks before anyone notices, a misrouted log stream dumps secrets into a low-trust bucket, a dashboard says everything is green while a customer-visible path has been broken for two days.

This module builds the reasoning and tooling for the second category:

  • threat modeling so you know what you are defending and from whom
  • identity, secrets, encryption, and network controls so that "what if someone got in here" has a non-catastrophic answer
  • metrics, logs, and traces so that you can tell the difference between a working system and a dead one, quickly
  • alerts, dashboards, and runbooks so that the humans on call can act correctly under fatigue

You are learning to run systems that someone actually trusts.

Local Security and Observability Track

Do not buy a managed security or observability platform just to learn the control. Start locally unless provider integration is the specific lesson.

  • Run STRIDE and data-flow threat-model exercises against diagrams and pull requests before creating resources.
  • Use local or mocked secrets flows for practice: Docker secrets, sealed .env.example files, a local Vault dev server, or scripted fake dynamic credentials are acceptable when the evidence is the rotation path and access boundary.
  • Export OpenTelemetry traces to a local Collector, console exporter, Jaeger, or Grafana Tempo; export metrics to Prometheus; route logs to stdout, files, Loki, or another local sink.
  • Practice detection with local attacks and mistakes: bad IAM policy examples, overbroad Kubernetes RBAC, leaked test secrets caught by scanners, log redaction failures, and high-cardinality metrics.
  • Use paid cloud security, logging, tracing, or metrics services only after billing alerts, retention limits, least-privilege roles, and teardown/cleanup steps are documented.

Concept Map


How To Use This Module

Work in order. Security and observability are siblings: each fails in the dark, and each relies on the same engineering habits of naming things correctly and paying attention to what you already have.

Cluster 1: Cloud Security Foundations

OrderConceptTypeFocus
1Threat Modeling (STRIDE) for Cloud ServicesPRIMARYA repeatable way to ask "what can go wrong here" with real examples per STRIDE letter
2Identity-Centric Security: The New PerimeterPRIMARYWhy IAM, not the firewall, is the boundary in cloud systems
3Defense in Depth: Network, Host, App, Data LayersPRIMARYStacking imperfect controls so no single failure is catastrophic

Cluster mastery check: Can you run a STRIDE pass on a 3-service system and name one real control per letter?

Cluster 2: Secrets, Keys, and Data

OrderConceptTypeFocus
4Secret Management: Vaults, Dynamic Secrets, RotationPRIMARYGetting secrets out of source, out of env files, and into a managed lifecycle
5Encryption: At-Rest, In-Transit, and KMS Envelope EncryptionPRIMARYDEK/KEK flow, what each layer actually protects, and where it fails
6Data Classification and MinimizationPRIMARYTreating data as a liability; the habit of asking "do we even need this"

Cluster mastery check: Can you diagram the envelope-encryption flow from memory and explain why rotating the KEK does not require re-encrypting the data?

Cluster 3: Network and Runtime Security

OrderConceptTypeFocus
7Security Groups, NACLs, and VPC Endpoints: The Network MoatPRIMARYWhat each control actually filters and why "the moat" is now deep but narrow
8Image Hardening and Supply-Chain ScanningPRIMARYMinimal base images, provenance, signing, and vulnerability scanning tied to SLSA
9Runtime Detection and Response: CSPM and CWPPPRIMARYPosture vs workload protection, and why both are needed

Cluster mastery check: Can you pick the correct network control for a given goal, and explain why image hardening complements runtime detection instead of replacing it?

Cluster 4: Observability Pillars in Cloud

OrderConceptTypeFocus
10Metrics: Cardinality, Exemplars, and USE/RED in Cloud-NativePRIMARYWhy high-cardinality labels are a production hazard and what to use instead
11Structured Logging and Log RoutingPRIMARYJSON logs with stable keys, routing pipelines, and redaction
12Distributed Tracing: OpenTelemetry and Sampling StrategiesPRIMARYSpans, context propagation, semantic conventions, and head vs tail sampling

Cluster mastery check: Can you instrument a single endpoint with a metric, a structured log line, and a span, each using stable naming?

Cluster 5: Operating Under Observation

OrderConceptTypeFocus
13Dashboards That Answer Questions, Not DecorationsPRIMARYDesigning dashboards around the questions on-call asks, not the metrics you have
14Alerting on Symptoms, Not Causes; The Silent Runner ProblemPRIMARYSymptoms vs causes, with 2 bad and 2 good alert examples
15Runbooks and On-Call HygieneSUPPORTINGWriting runbooks that survive a tired operator at 3 a.m.

Cluster mastery check: Can you take a real failure mode and produce a one-page runbook that a teammate could execute without you on the call?

Practice Path

OrderPractice pageFocus
1Threat Modeling LabRun STRIDE on a real architecture and produce mitigations
2Secrets and Encryption WorkshopVault workflows, envelope encryption, and rotation
3Observability Design ClinicMetrics, logs, traces, dashboards, alerts end-to-end
4Security / Observability Code KatasFive katas combining threat modeling, secrets, tracing, and alerting

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.


Learning Objectives

By the end of this module you should be able to:

  1. Run a STRIDE threat-modeling pass on a cloud system and produce at least one concrete mitigation per letter.
  2. Describe why identity is the new perimeter and design IAM policies using least privilege.
  3. Apply defense in depth across network, host, application, and data layers and explain what each layer protects against.
  4. Manage secrets using a vault with dynamic credentials and a documented rotation path.
  5. Explain envelope encryption using a DEK/KEK diagram and justify where it is used and where it is not sufficient.
  6. Classify data by sensitivity and apply minimization before storage.
  7. Choose the correct network control (security group, NACL, VPC endpoint) for a given requirement.
  8. Harden a container image, scan it for vulnerabilities, and reason about supply-chain integrity (SLSA, signing).
  9. Instrument a service with metrics, structured logs, and OpenTelemetry traces using stable semantic conventions.
  10. Write alerts that page on symptoms rather than causes and justify each with an SLO or user-visible impact.
  11. Write a runbook that is usable at 3 a.m. by a teammate who has never seen the system.

Outputs

  • one threat model for a 3-service system, one mitigation per STRIDE letter
  • one working demo of secret retrieval and rotation through a vault-style flow (even if mocked locally)
  • one envelope-encryption diagram drawn from memory with the DEK/KEK lifecycle explained in sentences
  • one data classification sheet for the same 3-service system
  • one network-control decision table: security group vs NACL vs VPC endpoint
  • one minimal container image spec, scanned, with a written list of findings resolved vs accepted
  • one instrumented service spec with three named metrics, one structured log schema, and one traced endpoint using semantic conventions
  • one dashboard spec answering "is the service healthy for users right now" with no more than six panels
  • one alert spec set (at least 4 alerts), each tagged as symptom or cause with justification
  • one runbook for a real failure mode with entry conditions, diagnostic steps, mitigations, and a rollback step

Completion Standard

You have completed Module 5 when all of these are true:

  • you can walk through STRIDE on a cloud system live, not from notes
  • you can distinguish authentication, authorization, identity boundary, and network boundary and explain which one a given attack targets
  • you can diagram envelope encryption and defend the design choices
  • you can classify data, choose a secret store, and design a rotation workflow
  • you can instrument a service with metrics, logs, and traces using stable names
  • you can look at an alert spec and say whether it is a symptom or a cause alert and why
  • you can hand a runbook to a peer and have them execute it without you on the call

Reading Policy

  • Concept pages are the main path. External official docs (OWASP, NIST, cloud providers, OpenTelemetry, Prometheus, Google SRE book) are the main escalation.
  • Book chunks (Pro Git, The Linux Command Line) are loosely relevant and are referenced only where they sharpen operational habits.
  • See also (external) means "open one of these if the concept page is not enough", not "read the entire source".
  • If you open three external links and still cannot explain the concept in your own words, write down the specific blocker before continuing.

Suggested Weekly Flow

DayWork
1Concepts 1-3 and one STRIDE pass on a real or real-shaped system
2Concepts 4-6 and the secrets/encryption workshop setup
3Concepts 7-9 and one image-hardening experiment
4Concepts 10-12 and one instrumented endpoint with metrics, logs, and a trace
5Concepts 13-15 and the dashboard/alert/runbook triple
6Practice pages 1-2 and code katas 1-3
7Practice pages 3-4, remaining katas, quiz, and mistake-log cleanup

Reference

Use Reference and Selective Reading only when a concept page plus one external link have not resolved the gap. This module is external-reference-heavy on purpose; it matches how you will actually work.


Rich Learning Pages

Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread


Model Artifact Calibration

For operational readiness evidence, compare your procedure to the runbook model artifact.