Semester 9 Exam
Required Output Classification
| Required output | Classification | Public/private guidance |
|---|---|---|
| Timed written answers, diagrams, code snippets, and design responses | Checkpoint evidence | Keep raw exam work private so it remains useful for assessment and retake calibration. |
| Post-exam review notes, missed-answer repairs, and Feynman explanations | Practice artifact | Use for spaced review; publish only rewritten explanations that no longer reveal exam solutions wholesale. |
| Capstone-defense or architecture-defense packets created from exam prompts | Portfolio candidate | Polish publicly only when they are original to your project, sanitized, and framed as engineering rationale rather than exam answers. |
Three-hour, open-book-but-not-open-chat assessment with a mix of concept recall, design scenarios, one operational incident scenario, and one security scenario. Answer in your own words; citing a doc URL is welcome but no answer should be a pasted quote.
Instructions
- Duration: 3 hours. Split roughly: 30 min on A/B (recall), 60 min on C/D (design + operations), 45 min on E (security + observability), 30 min on F (interleaved), 15 min to review.
- Allowed aids: your project repo, the official docs for your chosen cloud provider (AWS/GCP/Azure), Kubernetes, Terraform, and GitHub Actions. No LLMs, no chat rooms, no pre-written answers.
- Every design answer must name at least one tradeoff and one failure mode it accepts.
- Grading is done against the rubric at the bottom; aim for "I can defend this against a senior engineer," not "I wrote a lot."
Section A: Cloud Platform Fundamentals
- Explain the shared-responsibility model for (a) an EC2 VM, (b) a managed Postgres database (RDS / Cloud SQL), and (c) a serverless function. For each, name two failures that are yours and two that are the provider's.
- You are designing a 3-tier application in a single VPC across three availability zones. Draw the subnet layout (public, private-app, private-data), describe how traffic reaches the app from the internet, and explain why the database endpoint should not live in a public subnet. Identify where a NAT gateway is needed and the cost implication of putting one in every AZ.
- A teammate says "roles and users are basically the same in IAM, we just pick whichever is more convenient." Explain what they have wrong, and give a concrete production scenario (CI/CD, running workload, human break-glass access) where the choice matters.
Section B: Infrastructure as Code
- Define Terraform state in one paragraph: what it stores, why it exists, and what breaks when two engineers run
applyagainst the same state without a lock. Then describe the minimum production-safe remote state setup you would use on your chosen provider. - A module you are reviewing takes 30 inputs, 10 of which are booleans like
enable_loggingorcreate_iam_role. Critique the interface. When is a reusable module a good idea, and when has it become an anti-pattern? Give one concrete rewrite strategy. terraform planshows a proposed change of "replace" on a production database. Walk through your response step-by-step: what you check first, how you confirm whether replacement is actually intended, and what you do if it is not. Mention at least one Terraform feature that prevents the worst outcome.
Section C: Container Orchestration
- Explain what the Kubernetes control plane is and what each component does in two to three sentences (API server, scheduler, controller manager, etcd). Then describe what happens, end to end, when you run
kubectl apply -f deployment.yaml-- from the API server to a running pod on a node. - A
Deploymentkeeps restarting withCrashLoopBackOff. Walk through your diagnostic path using onlykubectland the cluster itself. Name at least five checks you would run, in order, and the kinds of root causes each rules in or out (config, image, permissions, resource limits, dependency availability).
Section D: CI/CD & Release Engineering
- Explain the four DORA metrics (deployment frequency, lead time for changes, change failure rate, mean time to restore). For each, name one concrete signal or query you could use in your project to measure it and one behavior that tends to improve the metric without actually improving delivery (i.e., a way to game it).
- Design a progressive-delivery strategy for an API change that alters a response schema. Compare canary, blue/green, and feature-flagged rollout, and justify which you would pick for this specific change. Include how you would detect a problem and how long it would take to recover.
- Your team's CI currently uses a long-lived AWS access key stored as a GitHub Actions secret. Explain, in ordered steps, how you would migrate to OIDC-based keyless auth: what infrastructure you create on the cloud side, what changes in the workflow YAML, and how you verify the old key is actually dead afterward.
Section E: Cloud Security & Observability
- Apply STRIDE to your project's ingress path (user -> DNS -> load balancer -> API pod). Identify at least one concrete threat per STRIDE category and the existing control or mitigation. Where you have no mitigation, say so honestly and propose the cheapest one to add.
- Define the difference between an SLI, an SLO, and an SLA. For your
apiservice, write one SLI (with an exact measurement rule -- numerator, denominator, time window), a corresponding SLO target, and describe the error budget and what happens when it is exhausted. - An alert fires at 03:00: "
apip95 latency > 2s for 10 minutes." You have dashboards, logs, and traces. Describe, in order, the five things you look at, what each one rules in or out, and what runbook entry you wish you had written before this night.
Section F: Interleaved (Prior Semesters)
- [From S6 / S8] Your managed Postgres is on one primary in AZ-a. Product wants "no read downtime during an AZ failure and no data loss." Describe your options (multi-AZ standby, read replicas, regional failover), the consistency and recovery implications of each, and which you would pick and why.
- [From S7] You are standing up a second service that needs to call your
api. Describe the context map between them (is it customer/supplier, conformist, anti-corruption layer?), the API compatibility rules you would adopt, and which ADR you would add to the project's ADR log to capture the decision. - [From S5] Walk through what happens at the OS and network level when a pod in your cluster makes an outbound HTTPS call to a third-party API: DNS resolution, TCP connect, TLS handshake, and how the request egresses the VPC (NAT gateway, security group, route table). Identify at least two places this call can fail silently from the application's view.
Self-Grading Key
Score each section against this rubric before looking anything up a second time. You are aiming for "pass" on every section to consider the semester exam complete; sustained "needs work" on more than one section means you should not advance to Semester 10 yet.
| Section | Pass (3/3 or close) | Needs work (partial / vocabulary only) | Fail (guessing) |
|---|---|---|---|
| A: Fundamentals | Accurate boundaries, correct VPC diagram, concrete IAM scenario | Shared-responsibility mostly right but VPC diagram hand-wavy | Confuses region/AZ or public/private subnet |
| B: IaC | State explained in own words; module critique names interface smell; plan-review workflow is safety-first | Defines state but does not describe locking; module critique is vague | Treats terraform apply like kubectl apply; no drift concept |
| C: Kubernetes | Control plane clear; diagnostic path is ordered and covers 5+ layers | Names components but kubectl apply flow is muddled | Cannot tell Deployment from Pod or misses kubectl describe |
| D: CI/CD | DORA tied to real signals; progressive-delivery choice justified; OIDC migration is step-by-step | One DORA metric missing or gameable signal not named | Uses "agile" or "DevOps culture" as an answer |
| E: Security + O11y | STRIDE has one threat per category; SLI has a precise definition; incident walkthrough is evidence-driven | STRIDE is 3-4 categories; SLI defined loosely | Reaches for "we have logs" as the whole answer |
| F: Interleaved | Correctly integrates S5/S6/S7 reasoning; tradeoffs explicit | One of the three interleaved prompts answered weakly | Treats the interleaved section as a disconnected trivia round |
Mastery Rubric
| Level | Evidence |
|---|---|
| Beginner pass | Can answer direct questions and complete familiar exercises with light notes. |
| Solid pass | Can solve new variants, explain choices, and connect the work to Semester 8 System Design and Technical Leadership. |
| Strong pass | Can defend tradeoffs, identify failure modes, and produce clean evidence in the portfolio artifact. |
| Not ready | Relies on copied solutions, cannot explain mistakes, or lacks durable artifacts. |
Retake and Repair Rule
If a section is weak, do not only reread. Repair it by producing new evidence: a corrected solution, a fresh implementation, a rewritten proof, a benchmark, a diagram, a runbook, or a short teaching note.
Answer-Quality Examples
Use these examples when grading written answers or spoken explanations.
| Quality | Example pattern |
|---|---|
| Weak | Names a concept but gives no example, constraint, or failure case. |
| Acceptable | Defines the concept and applies it to a familiar exercise. |
| Strong | Applies the concept to a new variant and explains why an alternative would fail. |
| Portfolio-ready | Connects the concept to Semester 8 System Design and Technical Leadership, current project evidence, and a future capstone decision. |
Interleaving Prompt
For any missed answer, add one sentence starting with: This depends on an earlier skill because...
Calibration Materials
Use these learner-visible calibration materials before self-grading or requesting review: