Skip to main content

Semester 9: Cloud Infrastructure & DevOps

Year 4 -- Production Engineering | Phase 9 | Weeks 84--89 | 6 weeks

Curriculum Readiness: Blueprint

Semester 9 is roadmap-visible as Blueprint in the canonical readiness matrix. Use this cloud and DevOps material as structure and planning context until content/portal/readiness-matrix.json promotes it to Learner-ready or beyond.


Goal

Deploy, operate, and secure production-grade systems on a production-shaped local stack or a tightly bounded cloud sandbox using Infrastructure as Code, container orchestration, and automated delivery pipelines -- and be able to defend every choice against cost, failure, and security scrutiny.

Prerequisites

You should enter this semester with working fluency in the system-design and leadership outcomes of Semester 8 (decomposition, reliability/performance reasoning, written alternatives, SLIs/SLOs), the architecture artifacts of Semester 7 (drivers, views, ADRs, context maps), the distributed and data tradeoffs of Semester 6 (replication, consistency, partial failure), and the networking and operating-system foundations of Semester 5 (TCP/IP, DNS, filesystems, processes). Without that grounding, the cloud primitives here collapse into vocabulary with no referents.


Phase Completion Contract

  • Explain: IAM boundaries, IaC workflows, deployment safety, observability basics, and cloud tradeoffs around cost, control, and reliability.
  • Build: one supported deployment track: either a local-first production-shaped system or a tightly bounded cloud sandbox, both with infrastructure-as-code, CI/CD, security posture, and observability artifacts.
  • Evidence: deployment repo, pipeline definition, rollback path, security review, dashboards, runbook notes, and cost/safety evidence for the selected track.
  • Do not advance if: you still cannot deploy safely, read official cloud/tooling docs directly, or explain how the system would be operated after release.

Cost and Safety Policy

Semester 9 teaches production engineering without requiring learners to accidentally buy production infrastructure. Every project, lab, and checkpoint must choose one of these supported tracks and document the choice in the project README.

Track 1: Local-first production shape

Use this track by default when the learning objective is workflow, review, safety, or observability rather than a provider-specific managed service.

  • Run the application and dependencies with Docker Compose.
  • Run Kubernetes work locally with kind, minikube, or k3d before any managed-cluster exercise.
  • Validate Terraform locally with terraform fmt, terraform validate, plan-file review, tflint, Trivy/tfsec-style static scans, and mock resource diagrams when real resources are unnecessary.
  • Replace paid managed services with mocks or emulators: local Postgres, localstack-style cloud APIs, fake queues, and seeded object-storage fixtures are acceptable when they preserve the architecture decision being studied.
  • Use local observability tooling such as the OpenTelemetry Collector, Prometheus, Grafana, Loki/Promtail, Jaeger, or console exporters.

Track 2: Cloud sandbox

Use this track only when the learning objective requires real cloud control planes, IAM, managed Kubernetes, managed databases, or provider billing signals.

  • Create a strict budget ceiling before provisioning anything. The semester default is ≤ $50 total for the six-week project unless an instructor explicitly lowers it.
  • Configure billing alerts before the first real cloud deployment; no alert means no deploy.
  • Use least-privilege IAM, scoped OIDC roles, and short-lived credentials. Static access keys are not acceptable for CI.
  • Prefer short-lived resources, small node counts, low-retention logs, and dev/test SKUs. Tear resources down at the end of every lab session unless the page explicitly says otherwise.
  • Maintain a teardown checklist covering Terraform destroys, clusters, load balancers, NAT gateways, databases/snapshots, unattached volumes, container registries, log retention, and orphaned IPs.
  • Do not create long-lived paid resources for convenience. Anything paid that survives overnight must have a written reason, owner, expiration date, and alert coverage.

Modules

#ModuleFocus
1Cloud Platform FundamentalsShared responsibility, regions/AZs, compute/networking/storage primitives, IAM, and multi-account structure
2Infrastructure as CodeDeclarative infrastructure with Terraform, state and drift, modules, and review-driven change
3Container OrchestrationContainers, Kubernetes control plane and workloads, networking, RBAC, and operating a cluster
4CI/CD Pipelines & Release EngineeringTrunk-based development, DORA metrics, progressive delivery, feature flags, and quality/secrets gates
5Cloud Security & ObservabilitySTRIDE threat modeling, defense in depth, encryption and secrets, metrics/logs/traces, alerts and runbooks

Core Resources

BookRole
The DevOps Handbook (Kim et al.)Primary reference for delivery culture, flow/feedback/learning loops, and CI/CD practice
Kubernetes in Action (Marko Lukša)Depth reference for Kubernetes objects, workloads, networking, and cluster operation
Terraform: Up & Running (Brikman)Practical Terraform: modules, state, environments, and team workflows
Building Secure and Reliable Systems (Google)SRE-grade treatment of security and reliability as coupled concerns, including IAM, supply chain, and incident response
Software Engineering at Google (Winters et al.)Engineering practice at scale: CI, release, testing culture, and long-lived systems

Non-Technical Parallel Reading

Optional. The Phoenix Project is the recommended narrative companion; it turns the DevOps mindset into a story you can argue with.

BookTheme
The Phoenix Project (Kim, Behr, Spafford)Operations, flow, and the cost of undone work told as a novel

Cross-Cutting Tracks Active This Semester

TrackLevelFocus This Semester
A: TestingL5Test strategy across pipeline environments, contract tests between services, and non-functional checks (load, chaos, security scans) gating deployment
B: Git / CI/CDL5Trunk-based branching with short-lived branches, review culture enforced in CI, and fully automated deploys to production with reversible change
E: Engineering FundamentalsL5Production debugging, official-docs-first workflow, and operational writing such as runbooks and troubleshooting notes
C: SecurityL5Cloud IAM as the primary control plane, least privilege by default, OIDC-based keyless CI, secrets management, and supply-chain basics (SBOM, signed builds)
D: ObservabilityL4Cloud-native metrics, structured logs, and distributed traces tied to SLIs/SLOs, with dashboards and alerts that route to a documented runbook

Weekly Arc

WeekFocusModules
84Cloud platform foundations: shared responsibility, VPC, IAM, and account structureModule 1 + project scaffolding (empty Terraform repo, local-first or cloud-sandbox decision, budget/alert plan if cloud is used)
85Terraform end-to-end: providers, state, modules, and environment layoutsModule 2 + local validation/plan review first; apply only inside the selected track
86Kubernetes control plane and workloads, networking, and RBACModule 3 + prove manifests on kind/minikube/k3d before any managed cluster
87CI/CD pipeline, trunk-based flow, progressive delivery, and OIDC-based deploysModule 4 + wire GitHub Actions to deploy the project via Terraform/kubectl
88Threat modeling, secrets, and the three observability pillars with SLOsModule 5 + local OpenTelemetry/logging/threat-model evidence before paid observability services
89Integration, checkpoint, and examProject polish, cumulative review, checkpoint gate, semester exam

Spaced Repetition Schedule

Drive one new deck per module, and keep prior decks warm. Prior-semester reviews focus on the material most load-bearing for production: S8 system-design decks, S7 architecture/ADR decks, and S5 networking decks (DNS, TCP/TLS, routing), which pay for themselves the moment you touch a cloud VPC.

WeekNew DeckReview Decks
84S9M1 -- Cloud Platform FundamentalsS8 system-design decks; S5 networking decks
85S9M2 -- Infrastructure as CodeS9M1; S7 architecture/ADR decks
86S9M3 -- Container OrchestrationS9M1-M2; S5 networking + processes/filesystems
87S9M4 -- CI/CD & Release EngineeringS9M2-M3; S8 reliability/SLO decks
88S9M5 -- Cloud Security & ObservabilityS9M1, S9M3-M4; S7 context-map/boundary decks
89Cumulative S9 reviewAll S9 decks + rolling prior-semester mix

Weekly Learning Journal Schedule

Use the template at _templates/weekly-journal.md every week. Specific reflection prompts for this semester:

  1. What failed in a pipeline, a terraform apply, or a deploy this week, and what did you change in the code, the process, or the review to keep it from recurring?
  2. Name one security control you verified in the cloud this week (an IAM policy you tightened, a secret you rotated, a public endpoint you closed) and the evidence that proves it holds.
  3. Where is your service still blind? Describe one observability gap (missing metric, missing trace span, missing alert, or missing runbook entry) and what you would add with one more day of budget.

Semester Deliverables


Capstone Throughline

Every semester must leave behind evidence that can survive into the final capstone defense.


Model Artifact Calibration

Use the Terraform/IaC change review model artifact and the runbook model artifact to calibrate production-readiness evidence.


Enrichment Pages

Portfolio Artifact | Common Failure Modes | Bridge Review