Semester 9 Exam

Required Output Classification

Required output	Classification	Public/private guidance
Timed written answers, diagrams, code snippets, and design responses	`Checkpoint evidence`	Keep raw exam work private so it remains useful for assessment and retake calibration.
Post-exam review notes, missed-answer repairs, and Feynman explanations	`Practice artifact`	Use for spaced review; publish only rewritten explanations that no longer reveal exam solutions wholesale.
Capstone-defense or architecture-defense packets created from exam prompts	`Portfolio candidate`	Polish publicly only when they are original to your project, sanitized, and framed as engineering rationale rather than exam answers.

Three-hour, open-book-but-not-open-chat assessment with a mix of concept recall, design scenarios, one operational incident scenario, and one security scenario. Answer in your own words; citing a doc URL is welcome but no answer should be a pasted quote.

Instructions

Duration: 3 hours. Split roughly: 30 min on A/B (recall), 60 min on C/D (design + operations), 45 min on E (security + observability), 30 min on F (interleaved), 15 min to review.
Allowed aids: your project repo, the official docs for your chosen cloud provider (AWS/GCP/Azure), Kubernetes, Terraform, and GitHub Actions. No LLMs, no chat rooms, no pre-written answers.
Every design answer must name at least one tradeoff and one failure mode it accepts.
Grading is done against the rubric at the bottom; aim for "I can defend this against a senior engineer," not "I wrote a lot."

Section A: Cloud Platform Fundamentals

Explain the shared-responsibility model for (a) an EC2 VM, (b) a managed Postgres database (RDS / Cloud SQL), and (c) a serverless function. For each, name two failures that are yours and two that are the provider's.
You are designing a 3-tier application in a single VPC across three availability zones. Draw the subnet layout (public, private-app, private-data), describe how traffic reaches the app from the internet, and explain why the database endpoint should not live in a public subnet. Identify where a NAT gateway is needed and the cost implication of putting one in every AZ.
A teammate says "roles and users are basically the same in IAM, we just pick whichever is more convenient." Explain what they have wrong, and give a concrete production scenario (CI/CD, running workload, human break-glass access) where the choice matters.

Section B: Infrastructure as Code

Define Terraform state in one paragraph: what it stores, why it exists, and what breaks when two engineers run apply against the same state without a lock. Then describe the minimum production-safe remote state setup you would use on your chosen provider.
A module you are reviewing takes 30 inputs, 10 of which are booleans like enable_logging or create_iam_role. Critique the interface. When is a reusable module a good idea, and when has it become an anti-pattern? Give one concrete rewrite strategy.
terraform plan shows a proposed change of "replace" on a production database. Walk through your response step-by-step: what you check first, how you confirm whether replacement is actually intended, and what you do if it is not. Mention at least one Terraform feature that prevents the worst outcome.

Section C: Container Orchestration

Explain what the Kubernetes control plane is and what each component does in two to three sentences (API server, scheduler, controller manager, etcd). Then describe what happens, end to end, when you run kubectl apply -f deployment.yaml -- from the API server to a running pod on a node.
A Deployment keeps restarting with CrashLoopBackOff. Walk through your diagnostic path using only kubectl and the cluster itself. Name at least five checks you would run, in order, and the kinds of root causes each rules in or out (config, image, permissions, resource limits, dependency availability).

Section D: CI/CD & Release Engineering

Explain the four DORA metrics (deployment frequency, lead time for changes, change failure rate, mean time to restore). For each, name one concrete signal or query you could use in your project to measure it and one behavior that tends to improve the metric without actually improving delivery (i.e., a way to game it).
Design a progressive-delivery strategy for an API change that alters a response schema. Compare canary, blue/green, and feature-flagged rollout, and justify which you would pick for this specific change. Include how you would detect a problem and how long it would take to recover.
Your team's CI currently uses a long-lived AWS access key stored as a GitHub Actions secret. Explain, in ordered steps, how you would migrate to OIDC-based keyless auth: what infrastructure you create on the cloud side, what changes in the workflow YAML, and how you verify the old key is actually dead afterward.

Section E: Cloud Security & Observability

Apply STRIDE to your project's ingress path (user -> DNS -> load balancer -> API pod). Identify at least one concrete threat per STRIDE category and the existing control or mitigation. Where you have no mitigation, say so honestly and propose the cheapest one to add.
Define the difference between an SLI, an SLO, and an SLA. For your api service, write one SLI (with an exact measurement rule -- numerator, denominator, time window), a corresponding SLO target, and describe the error budget and what happens when it is exhausted.
An alert fires at 03:00: "api p95 latency > 2s for 10 minutes." You have dashboards, logs, and traces. Describe, in order, the five things you look at, what each one rules in or out, and what runbook entry you wish you had written before this night.

Section F: Interleaved (Prior Semesters)

[From S6 / S8] Your managed Postgres is on one primary in AZ-a. Product wants "no read downtime during an AZ failure and no data loss." Describe your options (multi-AZ standby, read replicas, regional failover), the consistency and recovery implications of each, and which you would pick and why.
[From S7] You are standing up a second service that needs to call your api. Describe the context map between them (is it customer/supplier, conformist, anti-corruption layer?), the API compatibility rules you would adopt, and which ADR you would add to the project's ADR log to capture the decision.
[From S5] Walk through what happens at the OS and network level when a pod in your cluster makes an outbound HTTPS call to a third-party API: DNS resolution, TCP connect, TLS handshake, and how the request egresses the VPC (NAT gateway, security group, route table). Identify at least two places this call can fail silently from the application's view.

Self-Grading Key

Score each section against this rubric before looking anything up a second time. You are aiming for "pass" on every section to consider the semester exam complete; sustained "needs work" on more than one section means you should not advance to Semester 10 yet.

Section	Pass (3/3 or close)	Needs work (partial / vocabulary only)	Fail (guessing)
A: Fundamentals	Accurate boundaries, correct VPC diagram, concrete IAM scenario	Shared-responsibility mostly right but VPC diagram hand-wavy	Confuses region/AZ or public/private subnet
B: IaC	State explained in own words; module critique names interface smell; plan-review workflow is safety-first	Defines state but does not describe locking; module critique is vague	Treats `terraform apply` like `kubectl apply`; no drift concept
C: Kubernetes	Control plane clear; diagnostic path is ordered and covers 5+ layers	Names components but `kubectl apply` flow is muddled	Cannot tell `Deployment` from `Pod` or misses `kubectl describe`
D: CI/CD	DORA tied to real signals; progressive-delivery choice justified; OIDC migration is step-by-step	One DORA metric missing or gameable signal not named	Uses "agile" or "DevOps culture" as an answer
E: Security + O11y	STRIDE has one threat per category; SLI has a precise definition; incident walkthrough is evidence-driven	STRIDE is 3-4 categories; SLI defined loosely	Reaches for "we have logs" as the whole answer
F: Interleaved	Correctly integrates S5/S6/S7 reasoning; tradeoffs explicit	One of the three interleaved prompts answered weakly	Treats the interleaved section as a disconnected trivia round

Mastery Rubric

Level	Evidence
Beginner pass	Can answer direct questions and complete familiar exercises with light notes.
Solid pass	Can solve new variants, explain choices, and connect the work to Semester 8 System Design and Technical Leadership.
Strong pass	Can defend tradeoffs, identify failure modes, and produce clean evidence in the portfolio artifact.
Not ready	Relies on copied solutions, cannot explain mistakes, or lacks durable artifacts.

Retake and Repair Rule

If a section is weak, do not only reread. Repair it by producing new evidence: a corrected solution, a fresh implementation, a rewritten proof, a benchmark, a diagram, a runbook, or a short teaching note.

Answer-Quality Examples

Use these examples when grading written answers or spoken explanations.

Quality	Example pattern
Weak	Names a concept but gives no example, constraint, or failure case.
Acceptable	Defines the concept and applies it to a familiar exercise.
Strong	Applies the concept to a new variant and explains why an alternative would fail.
Portfolio-ready	Connects the concept to Semester 8 System Design and Technical Leadership, current project evidence, and a future capstone decision.

Interleaving Prompt

For any missed answer, add one sentence starting with: This depends on an earlier skill because...

Calibration Materials

Use these learner-visible calibration materials before self-grading or requesting review:

Required Output Classification​

Instructions​

Section A: Cloud Platform Fundamentals​

Section B: Infrastructure as Code​

Section C: Container Orchestration​

Section D: CI/CD & Release Engineering​

Section E: Cloud Security & Observability​

Section F: Interleaved (Prior Semesters)​

Self-Grading Key​

Mastery Rubric​

Retake and Repair Rule​

Answer-Quality Examples​

Interleaving Prompt​

For any missed answer, add one sentence starting with: This depends on an earlier skill because...​

Calibration Materials​