Skip to main content

Semester 9 Exam

Required Output Classification

Required outputClassificationPublic/private guidance
Timed written answers, diagrams, code snippets, and design responsesCheckpoint evidenceKeep raw exam work private so it remains useful for assessment and retake calibration.
Post-exam review notes, missed-answer repairs, and Feynman explanationsPractice artifactUse for spaced review; publish only rewritten explanations that no longer reveal exam solutions wholesale.
Capstone-defense or architecture-defense packets created from exam promptsPortfolio candidatePolish publicly only when they are original to your project, sanitized, and framed as engineering rationale rather than exam answers.

Three-hour, open-book-but-not-open-chat assessment with a mix of concept recall, design scenarios, one operational incident scenario, and one security scenario. Answer in your own words; citing a doc URL is welcome but no answer should be a pasted quote.


Instructions

  • Duration: 3 hours. Split roughly: 30 min on A/B (recall), 60 min on C/D (design + operations), 45 min on E (security + observability), 30 min on F (interleaved), 15 min to review.
  • Allowed aids: your project repo, the official docs for your chosen cloud provider (AWS/GCP/Azure), Kubernetes, Terraform, and GitHub Actions. No LLMs, no chat rooms, no pre-written answers.
  • Every design answer must name at least one tradeoff and one failure mode it accepts.
  • Grading is done against the rubric at the bottom; aim for "I can defend this against a senior engineer," not "I wrote a lot."

Section A: Cloud Platform Fundamentals

  1. Explain the shared-responsibility model for (a) an EC2 VM, (b) a managed Postgres database (RDS / Cloud SQL), and (c) a serverless function. For each, name two failures that are yours and two that are the provider's.
  2. You are designing a 3-tier application in a single VPC across three availability zones. Draw the subnet layout (public, private-app, private-data), describe how traffic reaches the app from the internet, and explain why the database endpoint should not live in a public subnet. Identify where a NAT gateway is needed and the cost implication of putting one in every AZ.
  3. A teammate says "roles and users are basically the same in IAM, we just pick whichever is more convenient." Explain what they have wrong, and give a concrete production scenario (CI/CD, running workload, human break-glass access) where the choice matters.

Section B: Infrastructure as Code

  1. Define Terraform state in one paragraph: what it stores, why it exists, and what breaks when two engineers run apply against the same state without a lock. Then describe the minimum production-safe remote state setup you would use on your chosen provider.
  2. A module you are reviewing takes 30 inputs, 10 of which are booleans like enable_logging or create_iam_role. Critique the interface. When is a reusable module a good idea, and when has it become an anti-pattern? Give one concrete rewrite strategy.
  3. terraform plan shows a proposed change of "replace" on a production database. Walk through your response step-by-step: what you check first, how you confirm whether replacement is actually intended, and what you do if it is not. Mention at least one Terraform feature that prevents the worst outcome.

Section C: Container Orchestration

  1. Explain what the Kubernetes control plane is and what each component does in two to three sentences (API server, scheduler, controller manager, etcd). Then describe what happens, end to end, when you run kubectl apply -f deployment.yaml -- from the API server to a running pod on a node.
  2. A Deployment keeps restarting with CrashLoopBackOff. Walk through your diagnostic path using only kubectl and the cluster itself. Name at least five checks you would run, in order, and the kinds of root causes each rules in or out (config, image, permissions, resource limits, dependency availability).

Section D: CI/CD & Release Engineering

  1. Explain the four DORA metrics (deployment frequency, lead time for changes, change failure rate, mean time to restore). For each, name one concrete signal or query you could use in your project to measure it and one behavior that tends to improve the metric without actually improving delivery (i.e., a way to game it).
  2. Design a progressive-delivery strategy for an API change that alters a response schema. Compare canary, blue/green, and feature-flagged rollout, and justify which you would pick for this specific change. Include how you would detect a problem and how long it would take to recover.
  3. Your team's CI currently uses a long-lived AWS access key stored as a GitHub Actions secret. Explain, in ordered steps, how you would migrate to OIDC-based keyless auth: what infrastructure you create on the cloud side, what changes in the workflow YAML, and how you verify the old key is actually dead afterward.

Section E: Cloud Security & Observability

  1. Apply STRIDE to your project's ingress path (user -> DNS -> load balancer -> API pod). Identify at least one concrete threat per STRIDE category and the existing control or mitigation. Where you have no mitigation, say so honestly and propose the cheapest one to add.
  2. Define the difference between an SLI, an SLO, and an SLA. For your api service, write one SLI (with an exact measurement rule -- numerator, denominator, time window), a corresponding SLO target, and describe the error budget and what happens when it is exhausted.
  3. An alert fires at 03:00: "api p95 latency > 2s for 10 minutes." You have dashboards, logs, and traces. Describe, in order, the five things you look at, what each one rules in or out, and what runbook entry you wish you had written before this night.

Section F: Interleaved (Prior Semesters)

  1. [From S6 / S8] Your managed Postgres is on one primary in AZ-a. Product wants "no read downtime during an AZ failure and no data loss." Describe your options (multi-AZ standby, read replicas, regional failover), the consistency and recovery implications of each, and which you would pick and why.
  2. [From S7] You are standing up a second service that needs to call your api. Describe the context map between them (is it customer/supplier, conformist, anti-corruption layer?), the API compatibility rules you would adopt, and which ADR you would add to the project's ADR log to capture the decision.
  3. [From S5] Walk through what happens at the OS and network level when a pod in your cluster makes an outbound HTTPS call to a third-party API: DNS resolution, TCP connect, TLS handshake, and how the request egresses the VPC (NAT gateway, security group, route table). Identify at least two places this call can fail silently from the application's view.

Self-Grading Key

Score each section against this rubric before looking anything up a second time. You are aiming for "pass" on every section to consider the semester exam complete; sustained "needs work" on more than one section means you should not advance to Semester 10 yet.

SectionPass (3/3 or close)Needs work (partial / vocabulary only)Fail (guessing)
A: FundamentalsAccurate boundaries, correct VPC diagram, concrete IAM scenarioShared-responsibility mostly right but VPC diagram hand-wavyConfuses region/AZ or public/private subnet
B: IaCState explained in own words; module critique names interface smell; plan-review workflow is safety-firstDefines state but does not describe locking; module critique is vagueTreats terraform apply like kubectl apply; no drift concept
C: KubernetesControl plane clear; diagnostic path is ordered and covers 5+ layersNames components but kubectl apply flow is muddledCannot tell Deployment from Pod or misses kubectl describe
D: CI/CDDORA tied to real signals; progressive-delivery choice justified; OIDC migration is step-by-stepOne DORA metric missing or gameable signal not namedUses "agile" or "DevOps culture" as an answer
E: Security + O11ySTRIDE has one threat per category; SLI has a precise definition; incident walkthrough is evidence-drivenSTRIDE is 3-4 categories; SLI defined looselyReaches for "we have logs" as the whole answer
F: InterleavedCorrectly integrates S5/S6/S7 reasoning; tradeoffs explicitOne of the three interleaved prompts answered weaklyTreats the interleaved section as a disconnected trivia round

Mastery Rubric

LevelEvidence
Beginner passCan answer direct questions and complete familiar exercises with light notes.
Solid passCan solve new variants, explain choices, and connect the work to Semester 8 System Design and Technical Leadership.
Strong passCan defend tradeoffs, identify failure modes, and produce clean evidence in the portfolio artifact.
Not readyRelies on copied solutions, cannot explain mistakes, or lacks durable artifacts.

Retake and Repair Rule

If a section is weak, do not only reread. Repair it by producing new evidence: a corrected solution, a fresh implementation, a rewritten proof, a benchmark, a diagram, a runbook, or a short teaching note.


Answer-Quality Examples

Use these examples when grading written answers or spoken explanations.

QualityExample pattern
WeakNames a concept but gives no example, constraint, or failure case.
AcceptableDefines the concept and applies it to a familiar exercise.
StrongApplies the concept to a new variant and explains why an alternative would fail.
Portfolio-readyConnects the concept to Semester 8 System Design and Technical Leadership, current project evidence, and a future capstone decision.

Interleaving Prompt

For any missed answer, add one sentence starting with: This depends on an earlier skill because...

Calibration Materials

Use these learner-visible calibration materials before self-grading or requesting review: