Module Quiz
Scenario-heavy, judgmental. There is a right answer for each, but the credit is in the reasoning. Complete after the concept and practice pages; before signing the PRR.
Current Module Questions
Question 1: SLO Sanity
Your capstone's webhook ingestion currently succeeds 99.2% of the time over the last 30 days, and traffic is about 200,000 requests per month. A team member proposes an SLO of 99.99%. Is this a reasonable SLO to commit to right now? Why or why not?
Answer: No. 99.99% over 30 days allows roughly 4 minutes of downtime and about 20 failed requests out of 200,000; current performance (99.2%) already consumes vastly more than that. An SLO you are systematically failing on day one will get ignored within a month. Commit to something defensible (e.g., 99.5%) with a plan to tighten once reliability investments have landed.
Question 2: Alert Discipline
An operator says: "I added a new page-level alert on disk_used > 80% so we never run out of disk." Critique this decision and propose an alternative.
Answer: disk_used > 80% is not a user-visible symptom on its own. It is a capacity signal. A page-level alert on capacity trains the operator to ignore the pager. Better: move this to a ticket-level alert, and let the SLO-burn alerts catch the user impact if disk pressure actually causes latency or errors.
Question 3: Low-Traffic Alerting
At capstone traffic of ~10 requests per minute, a 1-minute, "error rate > 1%" alert fires constantly on single failures. What multi-window rewrite is appropriate?
Answer: Require both a short (e.g., 5m) window and a longer (e.g., 1h) window to exceed the burn-rate threshold before paging. This filters out single-request noise while still catching sustained burns. At very low traffic, consider dropping short-window pages entirely and keeping only the slow-burn (6h/24h) ticket.
Question 4: Structured Log Judgment
Given the log line:
"user alice (id 42) failed login from 10.0.0.5 because wrong password"
rewrite it as a structured log event with stable field names suitable for aggregation.
Answer:
{"ts":"...", "level":"warn", "event":"auth.login.failed",
"user_id":"42", "src_ip":"10.0.0.5", "reason":"bad_password",
"request_id":"...", "trace_id":"..."}
Key moves: dotted event name, stable keys, reason as an enum-like value, and correlation IDs.
Question 5: Dashboard Minimalism
Your current dashboard has 28 panels. Under the "three specific questions" rule (Healthy? Slow? Failing whom?), what should happen to a panel titled "Memory utilization (last 24h)"?
Answer: It does not answer any of the three questions directly. Move it to a separate "capacity" dashboard and remove it from the live-ops dashboard. If memory pressure ever actually impacts latency or errors, it will show up on the existing "Slow?" row.
Question 6: Trace vs Log
A user reports /webhook took ~900 ms and they are on provider=stripe. Logs show the request succeeded. Which signal answers the question "where did the 900 ms actually go?"
Answer: The trace. Logs say that it succeeded and at a business level why. Only the trace's parent/child span durations reveal which hop consumed the time (DB insert? external HTTP? queue publish?). A correlated trace_id in the logs lets you jump from one to the other.
Question 7: STRIDE Walk
For the flow [User] --signed JWT--> [API] --AWS SDK--> [DynamoDB], pick the STRIDE letter most likely to surface a real, unmitigated issue in a capstone and justify briefly.
Answer: Most often R -- Repudiation, because developers rarely log enough signed evidence to prove a user actually took an action, and logging is generally underweighted relative to auth / TLS / IAM work. Alternative defensible answer: I -- Information disclosure, because failed-request logging commonly leaks request bodies by mistake.
Question 8: Secrets Scenario
You discover a committed API key in a month-old commit on main. Name the first five actions, in order.
Answer:
- Revoke the key at the provider (Stripe, GitHub, etc.) immediately.
- Rotate to a new key in your secrets manager; redeploy.
- Search Git history (
git log -p,trufflehog,gitleaks) to determine how many commits and branches contain the key. - If push access to external forks is possible, assume disclosed -- history rewrite is not sufficient.
- File an incident and postmortem; add a pre-commit / CI scan to prevent recurrence.
Question 9: Least-Privilege Diff
Your api-runtime role currently has AmazonS3FullAccess. Describe the minimum experiment that proves whether its actually-used permissions are narrower.
Answer: Enable CloudTrail for data events on the relevant buckets (or use IAM Access Analyzer / Access Advisor). Over a representative window, collect the S3 API calls the role actually makes. Replace the managed policy with a narrow inline policy listing those actions and scoped to those specific resources. Deploy to staging and confirm no AccessDenied appears during normal operation; widen minimally if it does.
Question 10: Failure Ranking
You have time to prepare only three failures. Your candidates are: (a) region outage, (b) bad deploy, (c) leaked API key, (d) Postgres restart, (e) downstream API slow. Which three should you prepare and why?
Answer: Likely (b) bad deploy, (e) downstream slow, and (c) leaked API key. (b) and (e) are both high-likelihood given your release cadence and external dependency. (c) is medium-likelihood but catastrophic impact and has a short, well-defined runbook. (a) region outage is low-likelihood and expensive to prepare for at capstone scale -- accept and monitor. (d) Postgres restart is usually absorbed by a reasonable retry and is a subset of generic DB-failure runbooks.
Question 11: Retry Danger
Why is retrying a non-idempotent payment POST with a 500 response dangerous, even if you use exponential backoff?
Answer: The original request may have succeeded before the 500 reached you (network blip after the write committed). A retry then duplicates the charge. Either ensure idempotency via an idempotency key on the payment API, or do not retry. Backoff does not fix this -- it only slows the duplication.
Question 12: Circuit Breaker Justification
A downstream service has had three 30-second outages this week. Your retry policy is 3 attempts with backoff. Explain why a circuit breaker is still worth adding.
Answer: Retries alone amplify the outage: during each outage you issue 3x the failing requests, which can saturate a struggling upstream and delay its recovery. A breaker, once open, short-circuits retries across all in-flight requests, fails fast, and allows the dependency to recover. Without a breaker, a bad 30s becomes a bad 90s every time.
Question 13: Backup Reality
A team says "we enabled RDS snapshots; backups are handled." Why is this answer insufficient for a PRR?
Answer: Enabled backups have not been verified by restore. PRR requires evidence of a restore into a separate instance, timed against your RTO, with issues found and a runbook updated. "Enabled" is necessary; "drilled within the last 30 days" is sufficient.
Question 14: Runbook Structure
A runbook's mitigation section reads: "If errors are high, restart the service." Critique this and rewrite one step correctly.
Answer: Critique: no expected effect, no rollback, and no condition that distinguishes "actually useful" from "reflex." Rewrite: "If the fast-burn alert has persisted for > 10 min and traces show > 50% of failures are connection_reset, restart the service via kubectl rollout restart deploy/api. Expected effect: error rate drops below 1% within 2 min. Rollback: if errors do not drop, do not restart a second time -- escalate and investigate (possible downstream cause)."
Question 15: PRR Honesty
You are walking the 18-item PRR. Item 15 ("backup + restore drill within 30 days") cannot be marked green because no drill was performed. A peer suggests marking it green anyway "since backups are enabled." Defend the correct action.
Answer: Mark it red or yellow, not green. Red if it is a blocker (most of the time). Yellow only if you explicitly accept the risk for a limited time, with a rationale and an expected-by date (e.g., drill scheduled within 7 days). Marking untested infrastructure green destroys the trust PRRs exist to build; the next green-that-should-be-yellow is how incidents hide in plain sight.
Interleaved Review Questions
Prior Module Question 1 (S10 M03 -- Cloud Deployment & CI/CD)
Why is an immutable, signed artifact in CI a prerequisite for most of this module's security items?
Answer: Supply-chain verification (provenance check at deploy) and "tighten the CI role" (least privilege on ci-deployer) both assume the artifact is a stable, reviewable unit. Without immutable signed artifacts, you cannot attest to what was built, cannot verify it at deploy, and cannot confidently scope the CI role.
Prior Module Question 2 (S10 M02 -- Implementation & Testing)
How do integration tests from the testing pyramid support the mitigation-decision table in Cluster 4?
Answer: The table's retry/breaker/degraded-mode decisions must be tested with a failing dependency, which is an integration test, not a unit test. Without integration tests that simulate dependency failure, the table is aspiration.
Prior Module Question 3 (S9 M05 -- Cloud Security & Observability)
Which S9 M05 primitive is the direct prerequisite for "structured logs at decision boundaries" in Cluster 2?
Answer: The log aggregation pipeline and the metric-pipeline work from S9 M05. Structured logs only pay off if you can aggregate, filter, and alert on them; those capabilities are inherited from the observability work already in place.
Prior Module Question 4 (S8 M04 -- Scale, Reliability, and Performance)
Why does the "alert on the SLO, not everything" concept descend directly from S8 M04 work on symptom-based alerting?
Answer: S8 M04 introduced the idea that alerts should describe user-visible symptoms tied to SLOs, not low-level causes. This module is a capstone-scale application of that principle with burn-rate math plugged in.
Prior Module Question 5 (S6 M05 -- Distributed Systems Fundamentals)
How does distributed-systems reasoning about partial failure inform retry-and-breaker decisions?
Answer: In a distributed system, "failed request" often means "I do not know." That ambiguity is what forces idempotency, bounded retry, breakers, and graceful degradation. Without the S6 framing (timeouts, causality, partial failure), the mitigation decisions are arbitrary instead of principled.
Self-Assessment and Remediation
Mastery Level (90-100% correct):
- Ready to sign PRR. Proceed to portfolio module.
Proficient Level (75-89% correct):
- Revisit the cluster covering missed questions. Redo the matching kata. Then sign PRR.
Developing Level (60-74% correct):
- Return to Cluster 1 and 4 concept pages with special attention to SLOs, burn rates, and failure ranking. Repeat the SLO/alert lab and the runbook kata before re-attempting.
Insufficient Level (< 60% correct):
- Do not sign PRR yet. Reread the full module, repeat all four practice pages, and retake the quiz in a week. The PRR will fail under scrutiny at the current level of understanding.