Skip to main content

Module Quiz

Complete this quiz after finishing all concept and practice pages.

Current Module Questions

Question 1: USE vs RED

You are handed a new Postgres-backed microservice with zero monitoring. Name one metric you would put on a dashboard from the USE method and one from the RED method, and explain which question each answers.

Answer: USE: put "CPU saturation" (e.g., run queue length or %iowait) on the dashboard. It answers "is any resource becoming the bottleneck?" RED: put "request duration p99" on the dashboard. It answers "what latency are users experiencing?" USE finds the resource that will tip over; RED finds user-visible pain. You need both because saturated resources sometimes still hide behind acceptable percentiles (briefly), and rising percentiles sometimes come from downstream dependencies, not this service's resources.

Question 2: Percentile Reasoning I

Service X has a latency distribution of [20, 25, 30, 35, 40, 45, 50, 55, 60, 65] ms (sorted). Service Y has [20, 22, 25, 28, 30, 32, 38, 42, 50, 468] ms.

a. Compute the mean latency for each. Are they similar? b. Compute p50 and p99 for each (use the 10th sample as p99 for this approximation). c. If a user page fans out to 10 independent requests, which service will more often produce a slow page?

Answer:

a. Mean of X = 42.5 ms. Mean of Y = 75.5 ms. Close-ish, but Y is 77% higher.

b. X: p50 ≈ 42.5 ms, p99 ≈ 65 ms. Y: p50 ≈ 31 ms, p99 ≈ 468 ms.

c. Y will much more often produce a slow page. Tail-at-scale: with 10 parallel requests, the probability the slowest hits the p99 is roughly 1 − 0.99^10 ≈ 9.6%. For Y, a p99 hit means 468 ms rendered; for X it means 65 ms. Y's p50 is better but its tail dominates the user experience. This is why averages lie: Y's mean is higher only because of the tail, and the tail - not the mean - is what the user feels.

Question 3: Percentile Reasoning II

Your monitoring team wants to compute "cluster-wide p99" by taking the average of each instance's p99 and reporting that number. Is this valid? If not, what should they do instead?

Answer: No. Percentiles do not average. Each instance's p99 is a different quantile of a different distribution; averaging them produces a number that corresponds to no real quantile of anything. The correct approach is to preserve the raw-request latency distribution (histograms, with buckets) from every instance and compute the percentile from the combined histogram. Tools like Prometheus histogram_quantile or HdrHistogram's add-and-query do exactly this: sum the bucket counts across instances, then compute the quantile on the summed distribution. The rule: never take mean, median, max, or percentile of percentiles.

Question 4: Amdahl vs USL

A parallel data-processing job has 5% serial fraction (Amdahl's s). It also exhibits contention with USL parameters α = 0.05, β = 0. At N = 100 workers, which law predicts a higher throughput, and what is the practical takeaway?

Answer: Amdahl's law predicts speedup S(N) = 1 / (s + (1−s)/N) = 1 / (0.05 + 0.95/100) = 1 / 0.0595 ≈ 16.8x. USL with β = 0 degenerates to essentially the same ceiling as Amdahl (roughly 1/α = 20x asymptotically). Takeaway: even without coherence costs, you get ~17x of 100x - four-fifths of the hardware's nominal capacity produces no speedup. If β > 0 (realistic), throughput would decline past a peak N and adding more nodes makes it worse. Always estimate α and β from measurement before promising linear scaling.

Question 5: Stateless Design

Your web tier uses sticky sessions because "it's faster." List two concrete failure modes this design has and propose a migration path.

Answer:

Failure modes:

  1. Node loss = session loss for that user. On deploy or crash, that user is logged out or their cart is lost.
  2. Load imbalance: long-running users pin disproportionate load on one node; the load balancer cannot rebalance mid-session.

Migration: move session state to an external store (Redis, Memcached, or signed JWT for small state). Keep the application tier truly stateless. Load balance by least-connections. Cost: one network hop per request (typically < 1 ms on the same VPC). Benefit: zero-downtime deploys, free horizontal rebalancing, and no "we lost user X's state" class of incident.

Question 6: Caching

A team caches a product row for 30 minutes using cache-aside. Prices change and the cache is not invalidated. What is the maximum staleness a user can see, and what update pattern would tighten this without killing hit rate?

Answer: Maximum staleness is 30 minutes (the TTL). Fixing options:

  • Invalidate on write. When a price update lands, the writing service explicitly evicts or updates the cache entry. Staleness drops to the propagation delay of the invalidate (milliseconds). This is the simplest fix.
  • Write-through. Make the cache the write target; it writes through to the DB atomically. Zero staleness but couples cache availability to write availability.
  • Shorter TTL (e.g., 60 seconds). Cheap to implement but hits DB harder.

The right answer in most systems is cache-aside with explicit invalidation on write plus a modest TTL as a safety net.

Question 7: SLI/SLO/Error Budget Scenario

Your service's SLO is 99.9% availability over 30 days. In the last 15 days of this window, you have observed:

  • total requests: 10,000,000
  • failed requests (5xx or > 1s latency): 11,000

a. What is the observed SLI? b. How much error budget has been consumed so far (as a percentage of the monthly budget)? c. A team wants to ship a risky feature tomorrow. What does the error-budget policy suggest? Justify.

Answer:

a. SLI = (10,000,000 − 11,000) / 10,000,000 = 99.89%.

b. Monthly budget at 99.9% over 10M requests for full month would be 0.1% × 10M × 2 = 20,000 (assuming steady rate; 15 days is half). For the half-month, budget is roughly 10,000 errors. We've consumed 11,000. Consumed 110% of the 15-day budget, or 55% of the full-month budget with half the time remaining.

c. The team is already burning faster than the budget allows (110% of the pro-rated share consumed). The error-budget policy says block the risky feature until the burn rate returns to normal - or take a measured risk only if the feature's expected reliability impact is net-positive (rare). The point of the error budget is this negotiation: when budget is healthy, velocity wins; when it is consumed, reliability work wins. The budget makes the trade-off numerical instead of political.

Question 8: Failure Modes

A single slow backend causes every caller's thread pool to saturate, which causes every upstream of those callers to also saturate. Which failure mode is this, what single pattern would have most contained it, and where in the call graph does the pattern go?

Answer: Cascading failure. The pattern: timeouts + circuit breakers at every remote-call boundary. Place them on the client side of every dependency call (not just at the ingress). A circuit breaker opens after a threshold of failures, short-circuits further calls for a cool-down period, and thereby isolates the slow backend from the caller's thread pool. Without this, one slow peer silently converts every caller into a slow caller, and so on upstream.

Question 9: Chaos Engineering

Your team's leadership says "why would we break things in production when we already test in staging?" Give a two-sentence argument for a production chaos experiment, and describe the blast-radius controls that make it safe.

Answer: Staging is a different system - different traffic shapes, different data distribution, different failure correlations - so passing in staging says nothing definitive about production. Blast-radius controls make production experiments safe: (a) limit the experiment to a small fraction of traffic (1-5%) or one region; (b) monitor key SLIs with automatic abort thresholds; (c) define the hypothesis and steady-state metric in writing; (d) a big red button (kill switch) that stops the experiment in under 60 seconds. Without these, do not run in production; with these, you are buying empirical evidence of resilience.

Question 10: Little's Law

An HTTP server processes an average of 500 req/s with a mean response time of 200 ms. What is the average number of requests in-flight at any moment? Your thread pool is sized at 50. Is that enough?

Answer: L = λ × W = 500 × 0.200 = 100 requests in-flight on average. Your thread pool of 50 is insufficient - it will saturate and queue, adding latency. You need at least 100 worker threads (plus headroom: ~1.5x = 150) to handle mean load without queuing. Under peaks, you need more. Alternatively, reduce W (lower latency per request) or reduce λ per instance (scale out).

Question 11: Load Shedding

A service is receiving 10x its capacity in requests. You can either let all requests slow down to 10x latency, or fast-reject 90% and serve 10% fast. Which do you pick in general and why?

Answer: Fast-reject 90%. Under 10x overload, letting every request slow to 10x delivers almost zero useful responses (most clients have timed out, many retry, creating more load - a retry storm). Fast-rejecting 90% with 429 or 503 and serving 10% fast means: (a) 10% of users have a good experience, (b) 90% fail fast and can fall back or retry with backoff, (c) upstreams can shed in response. The total useful work is higher. The exception is when some requests are more valuable than others (paying customers, critical endpoints); use admission control to preserve those and shed the rest.

Question 12: Observability Pillars

You get a bug report: "checkout is slow for some users but not others." You have access to metrics, logs, and traces. In what order do you use them and why?

Answer: Metrics first to confirm and scope the issue: is the service-wide p99 elevated, or only for a subset of users? Metrics tell you whether this is widespread or targeted, and when it started. Traces next: pick a slow request trace and find the span that dominates the latency. That tells you where in the service graph the slowness lives. Logs last on the suspect span or service to answer why (error messages, parameter values, internal state). Using logs first is the single most common time-waster; without narrowing by metrics and traces, you are searching random log volume by hand.

Question 13: Incident Lifecycle

You are paged at 3am for a latency SLO breach. You discover a deploy 5 minutes before the alert. What is the correct first action, and why is it not "root-cause the bug"?

Answer: Roll back the deploy. The first action during active user impact is mitigation, not understanding. Rollback is the fastest mitigation with the smallest blast radius for a deploy-correlated alert. Root-causing can take hours; mitigation ends user pain in minutes. Only after the impact is contained do you investigate what the bug was, reproduce, and fix forward. Confusing mitigation with resolution is the single most common reason incidents run long.

Question 14: Blameless Postmortems

An engineer accidentally dropped a production index during a migration, causing 40 minutes of outage. The easy finding is "Engineer X ran the wrong command." Write the correct root-cause framing.

Answer: The correct framing asks: what system allowed a single wrong command to cause 40 minutes of outage? Contributing factors likely include: (a) production credentials available at the shell (no change-management gate), (b) no DROP INDEX confirmation or dry-run, (c) no online index replacement - the drop was destructive, (d) no backup or quick-restore path for the index, (e) no peer review on the migration script. Action items target each of these: require peer review for destructive DDL, add --dry-run and typed confirmation, build online index replacement, document rollback. The person is not the cause; the absence of these safeguards is. A blameful report shames one engineer; a blameless report produces four improvements the organization keeps forever.

Question 15: Composition

A new feature is being designed. It reads from a database, computes a recommendation, and returns it. Combine ideas from three different clusters of this module to make it production-ready: one scaling idea, one reliability idea, and one observability idea.

Answer:

  • Scaling: cache the recommendation with a 60s TTL in a reverse-proxy cache (or the CDN if per-user allowed) since top-N recommendations tend to concentrate on a hot set. Expect to remove ~90% of read load from the DB.
  • Reliability: wrap the recommendation computation in a timeout + circuit breaker (e.g., 200ms timeout, break after 5 consecutive failures). On break, fall back to a cheap default (most-popular list) so a downstream slowdown does not take the endpoint down.
  • Observability: emit a RED dashboard (rate, errors, duration) for the endpoint, a trace context on every call, and a cache_hit_rate SLI. Promote endpoint latency p99 < 300ms over 28d at 99.9% to an SLO with a burn-rate alert. First signal of trouble arrives before users notice.

Interleaved Review Questions

Prior Module Question 1 (S6 Module 5: Distributed Systems Fundamentals)

Why is a timeout the only mechanism a node has to distinguish "slow" from "dead," and what does that imply for back-pressure design under overload?

Answer: In an asynchronous network, no node can distinguish a slow peer from a dead peer by observation; only a timeout makes that call, and the choice is probabilistic. Implication: every remote call must have a timeout, and under overload, those timeouts will fire even when the peer is "fine but slow." Back-pressure design exploits this: when timeouts rise, upstreams shed load; under healthy conditions timeouts are rare; the system degrades gracefully. A design without timeouts (or with timeouts longer than meaningful to the user) converts peer slowdowns into upstream saturation and cascading failure.

Prior Module Question 2 (S6 Module 3: Replication and Partitioning)

How does cross-region replication interact with SLO design for write availability?

Answer: Synchronous cross-region replication makes writes wait for a quorum across WAN links, which adds tens to hundreds of ms and makes writes correlate with the WAN's tail. Asynchronous replication preserves write availability and latency in a single region but means some acknowledged writes will be lost if that region fails. The SLO for write latency and the SLO for write durability are in tension; you cannot have 99.999% multi-region durability and sub-10ms write latency. Pick the pair that matches the business requirement; surface the trade-off in the SLO doc.

Prior Module Question 3 (S3 Module 2: Probability)

Why does the binomial-tail intuition directly apply to tail-at-scale latency, and what does it predict about fan-out width?

Answer: If each of N independent parallel requests has a probability p of hitting the p99 threshold, the probability that at least one does is 1 − (1−p)^N. With p = 0.01 and N = 100, that is 63%. Doubling fan-out from 50 to 100 raises tail exposure from 39% to 63%. The prediction: fan-out width linearly amplifies tail-latency exposure. Mitigations are hedged requests, tighter p99s at each hop, or reducing fan-out.

Prior Module Question 4 (S5 Module 4: Networking)

Why does HTTP/2 or HTTP/3 help tail latency, and where does it not?

Answer: HTTP/1.1 suffers head-of-line blocking at the connection level; a single slow request delays others on the same connection. HTTP/2 multiplexes streams within a connection, reducing that effect. HTTP/3 uses QUIC (UDP-based), which avoids TCP's head-of-line blocking at the transport layer, so packet loss on one stream does not stall others. These help client-perceived latency, especially under loss. They do not help server-side tail - if the server itself is slow on a request, no transport fix makes it faster.

Prior Module Question 5 (S2 Module 3: Data Structures)

Why is the latency of a hash-table lookup often described as "amortized O(1)" but showing up as a tail spike in production?

Answer: Amortized O(1) includes the occasional rehash cost spread over many operations. The rehash itself is O(n) and happens on a single lookup, creating a sharp tail spike. In production, a hash-table backing an app cache or a language runtime (dict, map) causes occasional request-level latency spikes during resize. This is one reason p99 latency rises even when everything looks healthy. Fixes: pre-size, use incremental rehashing structures (e.g., Redis's dict), or use data structures with predictable worst-case like B-trees.

Self-Assessment and Remediation

Mastery Level (90-100% correct):

  • Ready for the capstone and any system-design interview round on reliability.
  • Your instinct should now be: when you see a production issue, map it to a specific cluster's vocabulary within a minute.

Proficient Level (75-89% correct):

  • Review missed concepts. Most gaps are in the percentile-reasoning and error-budget questions (2, 3, 7) - those reward fluency, not memorization.
  • Redo Practice 3 (Reliability and SLO Clinic) and one Kata from Practice 4.

Developing Level (60-74% correct):

  • Rework Cluster 1 (Performance Reasoning) and Cluster 3 (Reliability Engineering) end to end.
  • Redo Practice 1 (Performance Profiling Lab) and Practice 3 (Reliability and SLO Clinic).
  • Re-read the Google SRE Book chapters on SLIs/SLOs and on Managing Incidents; they will re-anchor your intuition.

Insufficient Level (<60% correct):

  • The module has not landed. Restart from the module index.md diagnostic questions. Identify which three concepts you cannot state in one sentence, and read the concept page plus its first "Read This Only If Stuck" link for each. Do not proceed to the capstone until you can pass this quiz above 75%.