Performance Profiling Lab
Retrieval Prompts
- State the USE method's three axes and which resource types they apply to.
- State the RED method's three metrics and which service types they apply to.
- State the Four Golden Signals and how they relate to USE and RED.
- Define p50, p95, p99, and p999 in one sentence each.
- State Amdahl's Law in its formula form and name the term that bounds maximum speedup.
- State the Universal Scalability Law's two penalty terms and what each represents.
Compare and Distinguish
Separate these pairs cleanly in writing:
- latency vs throughput
- utilization vs saturation
- average latency vs median latency
- p99 vs max
- Amdahl's Law vs Universal Scalability Law
- "faster under load" vs "more capacity under load"
Common Mistake Check
For each statement, identify the error:
- "Our average response time is 50ms, so users are happy."
- "We added 10 more CPUs and throughput only went up 2x - the load balancer must be broken."
- "CPU is at 60% so we have 40% headroom."
- "p95 of p99 across our ten servers was 200ms."
- "At 95% CPU utilization we're making maximum use of the machine."
Percentile Reasoning Drill
You have two services, both serving 10,000 requests/minute.
Service A latency distribution (milliseconds, 10 sampled buckets representing the distribution):
[20, 25, 28, 30, 32, 35, 40, 50, 60, 80]
Service B latency distribution:
[20, 22, 24, 26, 28, 30, 34, 40, 50, 626]
- Compute the mean latency for each.
- Compute p50, p90, and p99 for each (use the sorted sample directly).
- One team proposes "the two services have essentially the same performance because their averages are close." Write a 3-sentence rebuttal grounded in the numbers.
- For a user opening a page that fans out to 20 calls against the service, estimate the probability the slowest of those 20 calls sees the p99 latency under Service B. (Hint: tail-at-scale.)
USE/RED Dashboard Design
For a single Kubernetes Node running a Postgres database, design:
- The USE view (resources × U, S, E) -- list at least 5 resources and the specific metric you would graph for each.
- The RED view for the Postgres query workload -- rate, errors, duration.
- Which panels would you promote to SLO alerts? Which are investigation-only?
- Cite the one concrete signal that would have told you Postgres is saturated before query latency degrades.
Amdahl vs USL Calibration
A workload has s = 0.05 serial fraction (Amdahl) or α = 0.05, β = 0.0005 in USL with reference throughput C(1) = 100 req/s.
- Compute Amdahl's predicted speedup at
N = 10andN = 100. - Compute USL throughput at
N = 10, 50, 100, 200. Identify the peak N. - Past the peak, USL predicts throughput decreases with more nodes. Give one real mechanism that could cause that in a distributed service.
- Which law is more useful for predicting peak capacity, and which is more useful for sanity-checking a parallel algorithm's ceiling?
Flame Graph Sketch
You are told "latency doubled over the last week." You have access to production CPU flame graphs from a week ago and today.
- Describe what you would look for in the diff (new columns, widened columns, shifted flames).
- Name one change category each for: (a) a new code path, (b) a dependency slowdown, (c) a lock contention issue.
- What instrumentation would you add before the next deploy to shorten your next such investigation?
Evidence Check
This practice page is complete only if you can:
- Compute percentiles from a small sample and explain why averaging them across shards is invalid.
- Draw USE and RED dashboards for a real service from memory.
- Apply Amdahl and USL numerically to a proposed scale-out plan and identify the regime change.
- Tell the difference between "more CPU" and "faster CPU" for a given bottleneck.