Semester 9: Cloud Infrastructure & DevOps
Year 4 -- Production Engineering | Phase 9 | Weeks 84--89 | 6 weeks
Semester 9 is roadmap-visible as Blueprint in the canonical readiness matrix. Use this cloud and DevOps material as structure and planning context until content/portal/readiness-matrix.json promotes it to Learner-ready or beyond.
Goal
Deploy, operate, and secure production-grade systems on a production-shaped local stack or a tightly bounded cloud sandbox using Infrastructure as Code, container orchestration, and automated delivery pipelines -- and be able to defend every choice against cost, failure, and security scrutiny.
Prerequisites
You should enter this semester with working fluency in the system-design and leadership outcomes of Semester 8 (decomposition, reliability/performance reasoning, written alternatives, SLIs/SLOs), the architecture artifacts of Semester 7 (drivers, views, ADRs, context maps), the distributed and data tradeoffs of Semester 6 (replication, consistency, partial failure), and the networking and operating-system foundations of Semester 5 (TCP/IP, DNS, filesystems, processes). Without that grounding, the cloud primitives here collapse into vocabulary with no referents.
Phase Completion Contract
- Explain: IAM boundaries, IaC workflows, deployment safety, observability basics, and cloud tradeoffs around cost, control, and reliability.
- Build: one supported deployment track: either a local-first production-shaped system or a tightly bounded cloud sandbox, both with infrastructure-as-code, CI/CD, security posture, and observability artifacts.
- Evidence: deployment repo, pipeline definition, rollback path, security review, dashboards, runbook notes, and cost/safety evidence for the selected track.
- Do not advance if: you still cannot deploy safely, read official cloud/tooling docs directly, or explain how the system would be operated after release.
Cost and Safety Policy
Semester 9 teaches production engineering without requiring learners to accidentally buy production infrastructure. Every project, lab, and checkpoint must choose one of these supported tracks and document the choice in the project README.
Track 1: Local-first production shape
Use this track by default when the learning objective is workflow, review, safety, or observability rather than a provider-specific managed service.
- Run the application and dependencies with Docker Compose.
- Run Kubernetes work locally with kind, minikube, or k3d before any managed-cluster exercise.
- Validate Terraform locally with
terraform fmt,terraform validate, plan-file review,tflint, Trivy/tfsec-style static scans, and mock resource diagrams when real resources are unnecessary. - Replace paid managed services with mocks or emulators: local Postgres, localstack-style cloud APIs, fake queues, and seeded object-storage fixtures are acceptable when they preserve the architecture decision being studied.
- Use local observability tooling such as the OpenTelemetry Collector, Prometheus, Grafana, Loki/Promtail, Jaeger, or console exporters.
Track 2: Cloud sandbox
Use this track only when the learning objective requires real cloud control planes, IAM, managed Kubernetes, managed databases, or provider billing signals.
- Create a strict budget ceiling before provisioning anything. The semester default is ≤ $50 total for the six-week project unless an instructor explicitly lowers it.
- Configure billing alerts before the first real cloud deployment; no alert means no deploy.
- Use least-privilege IAM, scoped OIDC roles, and short-lived credentials. Static access keys are not acceptable for CI.
- Prefer short-lived resources, small node counts, low-retention logs, and dev/test SKUs. Tear resources down at the end of every lab session unless the page explicitly says otherwise.
- Maintain a teardown checklist covering Terraform destroys, clusters, load balancers, NAT gateways, databases/snapshots, unattached volumes, container registries, log retention, and orphaned IPs.
- Do not create long-lived paid resources for convenience. Anything paid that survives overnight must have a written reason, owner, expiration date, and alert coverage.
Modules
| # | Module | Focus |
|---|---|---|
| 1 | Cloud Platform Fundamentals | Shared responsibility, regions/AZs, compute/networking/storage primitives, IAM, and multi-account structure |
| 2 | Infrastructure as Code | Declarative infrastructure with Terraform, state and drift, modules, and review-driven change |
| 3 | Container Orchestration | Containers, Kubernetes control plane and workloads, networking, RBAC, and operating a cluster |
| 4 | CI/CD Pipelines & Release Engineering | Trunk-based development, DORA metrics, progressive delivery, feature flags, and quality/secrets gates |
| 5 | Cloud Security & Observability | STRIDE threat modeling, defense in depth, encryption and secrets, metrics/logs/traces, alerts and runbooks |
Core Resources
| Book | Role |
|---|---|
| The DevOps Handbook (Kim et al.) | Primary reference for delivery culture, flow/feedback/learning loops, and CI/CD practice |
| Kubernetes in Action (Marko Lukša) | Depth reference for Kubernetes objects, workloads, networking, and cluster operation |
| Terraform: Up & Running (Brikman) | Practical Terraform: modules, state, environments, and team workflows |
| Building Secure and Reliable Systems (Google) | SRE-grade treatment of security and reliability as coupled concerns, including IAM, supply chain, and incident response |
| Software Engineering at Google (Winters et al.) | Engineering practice at scale: CI, release, testing culture, and long-lived systems |
Non-Technical Parallel Reading
Optional. The Phoenix Project is the recommended narrative companion; it turns the DevOps mindset into a story you can argue with.
| Book | Theme |
|---|---|
| The Phoenix Project (Kim, Behr, Spafford) | Operations, flow, and the cost of undone work told as a novel |
Cross-Cutting Tracks Active This Semester
| Track | Level | Focus This Semester |
|---|---|---|
| A: Testing | L5 | Test strategy across pipeline environments, contract tests between services, and non-functional checks (load, chaos, security scans) gating deployment |
| B: Git / CI/CD | L5 | Trunk-based branching with short-lived branches, review culture enforced in CI, and fully automated deploys to production with reversible change |
| E: Engineering Fundamentals | L5 | Production debugging, official-docs-first workflow, and operational writing such as runbooks and troubleshooting notes |
| C: Security | L5 | Cloud IAM as the primary control plane, least privilege by default, OIDC-based keyless CI, secrets management, and supply-chain basics (SBOM, signed builds) |
| D: Observability | L4 | Cloud-native metrics, structured logs, and distributed traces tied to SLIs/SLOs, with dashboards and alerts that route to a documented runbook |
Weekly Arc
| Week | Focus | Modules |
|---|---|---|
| 84 | Cloud platform foundations: shared responsibility, VPC, IAM, and account structure | Module 1 + project scaffolding (empty Terraform repo, local-first or cloud-sandbox decision, budget/alert plan if cloud is used) |
| 85 | Terraform end-to-end: providers, state, modules, and environment layouts | Module 2 + local validation/plan review first; apply only inside the selected track |
| 86 | Kubernetes control plane and workloads, networking, and RBAC | Module 3 + prove manifests on kind/minikube/k3d before any managed cluster |
| 87 | CI/CD pipeline, trunk-based flow, progressive delivery, and OIDC-based deploys | Module 4 + wire GitHub Actions to deploy the project via Terraform/kubectl |
| 88 | Threat modeling, secrets, and the three observability pillars with SLOs | Module 5 + local OpenTelemetry/logging/threat-model evidence before paid observability services |
| 89 | Integration, checkpoint, and exam | Project polish, cumulative review, checkpoint gate, semester exam |
Spaced Repetition Schedule
Drive one new deck per module, and keep prior decks warm. Prior-semester reviews focus on the material most load-bearing for production: S8 system-design decks, S7 architecture/ADR decks, and S5 networking decks (DNS, TCP/TLS, routing), which pay for themselves the moment you touch a cloud VPC.
| Week | New Deck | Review Decks |
|---|---|---|
| 84 | S9M1 -- Cloud Platform Fundamentals | S8 system-design decks; S5 networking decks |
| 85 | S9M2 -- Infrastructure as Code | S9M1; S7 architecture/ADR decks |
| 86 | S9M3 -- Container Orchestration | S9M1-M2; S5 networking + processes/filesystems |
| 87 | S9M4 -- CI/CD & Release Engineering | S9M2-M3; S8 reliability/SLO decks |
| 88 | S9M5 -- Cloud Security & Observability | S9M1, S9M3-M4; S7 context-map/boundary decks |
| 89 | Cumulative S9 review | All S9 decks + rolling prior-semester mix |
Weekly Learning Journal Schedule
Use the template at _templates/weekly-journal.md every week. Specific reflection prompts for this semester:
- What failed in a pipeline, a
terraform apply, or a deploy this week, and what did you change in the code, the process, or the review to keep it from recurring? - Name one security control you verified in the cloud this week (an IAM policy you tightened, a secret you rotated, a public endpoint you closed) and the evidence that proves it holds.
- Where is your service still blind? Describe one observability gap (missing metric, missing trace span, missing alert, or missing runbook entry) and what you would add with one more day of budget.
Semester Deliverables
- All module quizzes completed
- All code katas completed
- All Feynman notes written
- All spaced repetition decks created
- Semester project completed with selected track, budget guardrails, and teardown evidence
- Checkpoint gate passed
- Cumulative review completed
- Semester exam completed
Capstone Throughline
Every semester must leave behind evidence that can survive into the final capstone defense.
- Artifact carried forward: deployment, runbook, and operational evidence.
- What to preserve: Preserve IaC, CI/CD records, deployment proof, runbooks, telemetry snapshots, and incident-style operational analysis.
- Module threads: Module 1: Cloud Platform Fundamentals, Module 2: Infrastructure as Code, Module 3: Container Orchestration, Module 4: CI/CD Pipelines & Release Engineering, and Module 5: Cloud Security & Observability.
- Defense prompt: In Semester 10, explain how this semester's artifact changed a capstone decision, reduced a risk, or made the final system easier to defend.
Model Artifact Calibration
Use the Terraform/IaC change review model artifact and the runbook model artifact to calibrate production-readiness evidence.