Module Quiz
Complete this quiz after finishing all concept and practice pages. Aim to answer without rereading the concept pages.
Current Module Questions
Question 1: STRIDE Letter Matching (Threat Modeling)
A webhook endpoint accepts signed payloads but does not verify the signature. An attacker replays an old payload from a legitimate sender. Which STRIDE letter best describes this threat, and what is a minimal mitigation?
Answer: Primarily S (Spoofing) with an element of T (Tampering) if the payload is modified.
Solution Walkthrough:
- The attacker is impersonating a legitimate caller -- that is spoofing.
- Without signature verification and replay protection, the system has no way to tell a real call from a replay.
- Mitigation: verify the signature on every call, include a timestamp and nonce in the signed payload, reject messages older than a short window or with a reused nonce.
If You Got This Wrong: Review Concept 1: Threat Modeling (STRIDE) for Cloud Services.
Question 2: Scenario -- Which Alert Is Wrong and Why?
On-call receives this page at 2 a.m.:
"CPU utilization exceeded 85% for 5 minutes on pod
api-7f9d(replica 3 of 10). The service success rate is 99.99%."
Is this a well-designed alert? Why or why not?
Answer: Badly designed. It is a cause alert masquerading as a pager. The user-visible service is healthy (99.99% success rate). CPU high on one of ten replicas is an internal signal that does not map to user harm.
Solution Walkthrough:
- Symptom vs cause: symptoms belong on pagers; causes belong on dashboards.
- The success-rate number already proves users are fine, so waking someone up wastes on-call attention and trains them to ignore future pages.
- Better: track CPU as a diagnostic panel; page on
success_rate < SLOor on saturation that actually predicts user harm.
If You Got This Wrong: Review Concept 14: Alerting on Symptoms, Not Causes.
Question 3: Identity as the Perimeter
A deployment pipeline currently uses a long-lived access key stored as a CI secret. Describe the identity-centric alternative in three operational steps.
Answer:
- CI authenticates to the cloud via federated identity (OIDC from the CI provider to the cloud's identity service) proving repo + workflow + branch.
- The cloud exchanges the federated token for a short-lived credential (minutes) scoped to the exact actions and resources the pipeline needs.
- No long-lived key is stored anywhere; every action is audited against the federated identity rather than a shared role.
If You Got This Wrong: Review Concept 2: Identity-Centric Security.
Question 4: Envelope Encryption Flow
Explain what rotating the KEK changes and what it does not change.
Answer: Rotating the KEK re-wraps the DEKs and invalidates the old KEK, so any previously stored wrapped_DEK must be re-wrapped with the new KEK. It does not require re-encrypting the actual ciphertext (the DEK itself is unchanged), which is why envelope encryption makes key rotation cheap.
If You Got This Wrong: Review Concept 5: Encryption and KMS Envelope Encryption.
Question 5: Dynamic Secrets
A service needs a database credential. What does a dynamic secret give you that a rotated static secret does not?
Answer: A dynamic secret is generated per request (or per session) with a short lease, a unique DB user, and automatic revocation. If the pod is compromised, the attacker gets a short-lived credential scoped to one app's role. A rotated static secret is still a shared, long-lived value between rotations, which has a much larger blast radius.
Question 6: Data Classification
Give three fields in a typical e-commerce system and classify each on the Public / Internal / Confidential / Restricted ladder, with one-sentence justification.
Answer: (sample answer; other reasonable answers accepted)
- Product SKU -- Public (safe to cache widely; no user or secret info).
- Hashed user ID used in analytics -- Internal (pseudonymous; leakage is not catastrophic but not for public distribution).
- Credit card number -- Restricted (regulated under PCI; should never be stored -- tokenize via processor).
Question 7: Network Control Selection
You want to ensure that traffic from your VPC to object storage never leaves your VPC. Which control do you use?
Answer: A VPC endpoint (AWS VPC Endpoint, GCP Private Google Access / Private Service Connect, Azure Private Endpoint). Security groups and NACLs filter reachability but do not control the route; endpoints create the private route that keeps the traffic inside your VPC.
Question 8: Image Hardening
Why does switching to a distroless base image reduce post-exploitation options specifically, rather than just the CVE count?
Answer: Distroless images remove the shell, package manager, and general-purpose binaries. Even if an attacker achieves code execution inside the container, they lack the interactive shells and tools used for lateral movement, reconnaissance, and payload fetching. CVE reduction is a side effect; the real win is denying the attacker post-exploitation tools.
Question 9: Scenario -- Which Alert Is Wrong and Why?
A team has this alert:
"One of five replicas of
orders-svcis unhealthy for more than 2 minutes -- page on-call."
The service is designed to tolerate up to two replica failures with no customer impact. Is the alert well-designed?
Answer: Badly designed. The system is engineered for replica failure; one replica down is below any user-visible threshold. The alert pages on a cause, not a symptom. It should be demoted to a diagnostic signal. Paging should fire only when replicas_available < min_viable or when the user-visible success rate breaches SLO.
If You Got This Wrong: Review Concept 14.
Question 10: Cardinality Hazard
Why is http_requests_total{user_id} almost always a bad idea?
Answer: user_id is high-cardinality and often unbounded. Each unique user creates a new time series; with millions of users the metric becomes millions of series. Storage and query cost explode, the time-series database can OOM, and no useful query typically aggregates by raw user ID. Use logs or traces to answer per-user questions; keep metrics bounded.
Question 11: Sampling Strategy
In a service where errors are 0.1% of traffic, is 1% head sampling a good idea?
Answer: No. At 1% head sampling, you keep roughly 0.001% of error traces -- most errors are invisible. Prefer tail sampling at the Collector that keeps 100% of errors and slow requests plus a percentage of normal traffic, giving cheap storage while preserving incident-relevant spans.
Question 12: Structured Logging
Give one query you can answer with structured JSON logs that is hard with unstructured logs.
Answer: Example: count(where event = "payment_declined" and reason_code = "insufficient_funds" and env = "prod") in last 1h, grouped by region -- trivial against typed keys, very painful against free-text log lines that encode the same data in a human sentence.
Question 13: Scenario -- Silent Runner
A nightly export job for orders was scheduled. Users downstream start reporting stale data. CPU, memory, and disk on the job cluster are all nominal, no pages fired overnight. What class of alert was missing, and what would it measure?
Answer: A silent-runner / freshness alert. It should measure absence of progress: for example, time_since_last_export_batch > 70 minutes (given a 60-minute SLA). Absence-of-progress alerts do not depend on a failure mode to fire loudly; they depend on a positive heartbeat that the work actually happened.
Question 14: Threat-Modeling Question
Apply STRIDE to a signed-webhook endpoint (as in Question 1 but more broadly). Give at least one finding for each of S, I, and D, and one mitigation per finding.
Answer (sample):
- S (Spoofing): unverified signatures -> verify signature + timestamp + nonce; reject old/replayed messages.
- I (Information disclosure): 500 responses include raw internal stack traces -> return generic error responses; log detail server-side only.
- D (Denial of service): unlimited retries from senders -> per-sender rate limits, size caps, idempotency keys so retries are safe.
If You Got This Wrong: Review Concept 1: Threat Modeling (STRIDE).
Question 15: Runbook Minimum Structure
Name the five sections a usable runbook must have, in order, and one sentence describing each.
Answer:
- Trigger -- which alert or symptom brings you here.
- Immediate verification -- 60-second checks to confirm the alert is real.
- Impact -- what users see right now.
- Diagnostic steps -- the decision tree for identifying the failure mode.
- Mitigations (reversible first) and rollback -- the ordered list of actions that have worked before.
Interleaved Review Questions
(from earlier modules in Semester 9 and cross-cutting tracks)
Prior Module Question 1 (M01 -- Cloud Platform Fundamentals)
Why does every cloud workload need a stable identity distinct from the network position it runs in?
Answer: Because identity is the access-control anchor cloud APIs use; relying on network position alone means any workload on the same network gets the same trust, which does not survive compromise.
Prior Module Question 2 (M02 -- IaC)
What does declaring infrastructure in code give you that click-ops does not, specifically for security review?
Answer: Reviewable diffs, versioned history, and the ability to run policy checks (OPA, cloud-provider policies, static analysis) on proposed changes before they land.
Prior Module Question 3 (M03 -- Container Orchestration)
Why do admission controllers sit naturally beside image signing and provenance?
Answer: Because admission is the last control point before a workload runs; verifying signatures and SLSA provenance at admission enforces supply-chain decisions at the moment of scheduling rather than relying on upstream discipline alone.
Prior Module Question 4 (M04 -- CI/CD)
Give two CI/CD habits that directly reduce supply-chain risk.
Answer: (1) Short-lived federated credentials instead of long-lived keys in CI, (2) signing and provenance attestations on every build artifact with cluster-side verification.
Prior Module Question 5 (Semester 8 M04 -- Scale, Reliability, Performance)
How does an SLO connect to alert design in this module?
Answer: The SLO defines what "healthy for users" means quantitatively; symptom alerts fire when the SLI trend predicts a breach. Without an SLO, symptom alerts have no defensible threshold.
Self-Assessment and Remediation
Mastery Level (90-100% correct):
- Ready to advance with confidence. Keep the runbook templates and STRIDE checklist in spaced repetition.
Proficient Level (75-89% correct):
- Review the concept pages for the missed items. Redo the relevant katas from practice page 4.
Developing Level (60-74% correct):
- Redo the threat-modeling lab and the observability design clinic end-to-end. Reread Concepts 1, 5, 10, 12, and 14 before retrying.
Insufficient Level (<60% correct):
- Return to the concept sequence and rebuild the symptom-vs-cause and DEK/KEK mental models before retrying. Pair with a peer to walk through one real system as a threat model plus observability spec.