Identify Bottlenecks and Single Points of Failure
What This Concept Is
A bottleneck is the component or resource that limits system throughput or latency; relieving it is what more-of-the-same-money buys you. A single point of failure (SPOF) is a component whose failure takes the system (or a critical path) offline. They are not the same thing:
- A slow database read is a bottleneck; it may still have replicas and HA.
- A unique coordinator service with one instance is a SPOF even if it is fast.
The Cluster 4 stress tests produce candidate bottlenecks and SPOFs. This concept is the explicit act of listing them, ranking them, and, for each one, stating whether you are fixing it now or accepting it.
A useful framing drawn from Fundamentals of Software Architecture: bottlenecks and SPOFs are the risk surface of your architecture characteristics. The "reliability" and "scalability" numbers you committed to in Cluster 1 become audit-able only when their failure modes are enumerated. A design that claims 99.99% availability but contains a never-tested, single-instance dependency is not a 99.99% design -- it is a 99.9% design with a confidence interval. Putting the SPOF list in the doc is how you turn characteristics from claims into commitments.
There is a discipline to the ranking. Bottlenecks rank by expected marginal impact per dollar of remediation. SPOFs rank by blast radius × probability × recovery time. These are not the same scoring rubrics; a cheap-to-fix moderate SPOF can outrank an expensive-to-fix severe one, and "accept with reason" is a legitimate outcome at every tier provided it is written down.
Why It Matters Here
Reviewers will find the bottleneck and SPOF you missed. You want to find them first and tell them.
- Unidentified SPOFs kill availability claims. "99.9%" is a fantasy if a single Redis node shared across services dies once a week.
- Unidentified bottlenecks waste money on horizontal scaling that cannot help.
- An explicit list of accepted bottlenecks and SPOFs is a sign of seniority; it says "we looked, these are the ones worth fixing now, these are the ones we're accepting for a reason".
This is also where this module connects directly to S8M4 (scale/reliability/performance).
Concrete Example
Feed design, stress-test output (abridged):
Candidate bottlenecks:
- Fan-out worker throughput when a large account posts (50 M follower rows × write cost). Remediation: celebrity detection threshold; batch write; read-fanout for top N%.
- Timeline store hot partition for users who follow many active accounts (partition size growth). Remediation: time-bucket partitions; cold-tier older entries.
- Ranking service CPU under peak load (ML scoring cost per post). Remediation: pre-score popular posts; cache scores per (user, post) for hot pairs; use a cheaper model for cold reads.
- Global LB TLS termination under spike load. Remediation: anycast with per-region termination; session resumption; larger instance families during anticipated spikes.
Candidate SPOFs:
- The central user-auth service. Failure blocks all requests. Remediation: multi-region active-active; token caching so momentary outages do not cascade.
- The ID allocator for posts. If centralized, death stops all writes globally. Remediation: distributed range allocation (each service instance pre-allocates 10 K IDs).
- A shared Redis cluster used by both reads and async workers. Failure affects multiple independent subsystems. Remediation: separate clusters by blast domain.
- The Kafka cluster for fan-out. If it is down, derived views stop updating. Accept-for-now; in-place replacement plan documented; consumer catch-up is bounded.
For a URL shortener, typical list:
- Bottleneck: cache hit rate under long-tail load (many unique redirects). Remediation: larger cache; tiered cache (edge CDN + regional Redis); allow brief increase in origin read load.
- Bottleneck: origin DB at miss-heavy traffic. Remediation: read replicas; per-shard partitioning on
short_code. - SPOF: single cache cluster. Remediation: per-region cache cluster; origin sized to survive one region's cache-cold.
- SPOF: ID allocator for short codes. Remediation: distributed allocation with collision retry or per-instance ID ranges.
Concrete Example 2: Payments Service
Stress-test output for a payments service handling card authorizations and merchant settlement:
Candidate bottlenecks:
- Synchronous call to the card network on the auth hot path. Each auth blocks a request thread for 200-800 ms. Remediation: strict per-network timeout (1 s), parallel fire on retries, circuit breaker if network is slow.
- Audit-log write amplification: every auth fans out to compliance log + fraud log + risk log. Remediation: single append-only event; derived logs built downstream via CDC.
- Ledger DB writes: serializable isolation for balance updates bottlenecks at ~2 K tx/s per shard. Remediation: shard by account_id; move heavy reporting reads to replicas; batch non-critical writes.
- Fraud-model inference cost on the hot path (~20 ms each). Remediation: pre-score quiet-time, cache by (customer, merchant, amount-band); fall back to a cheap model under load.
Candidate SPOFs:
- Card-network gateway (3rd-party): whole class of auths blocked if it fails. Remediation: multi-provider failover with per-provider circuit breakers; accept-with-reason for the tail of card types served by only one provider, written in the doc.
- Single-region Kafka cluster for settlement events. Failure loses no data (replicated) but blocks settlement. Remediation: cross-region mirror for disaster recovery.
- Central encryption key service (HSM-backed). Entire auth path depends on it. Remediation: short-lived DEK cache per service, key-rotation procedure exercises the failover quarterly.
- Centralized idempotency-key store: every auth writes here. Remediation: partition by key prefix; local LRU cache with bounded staleness.
Each entry is written with the fix-now vs accept-with-reason decision attached. The accept-with-reason lines are the ones reviewers probe hardest; having the written reasoning is what converts the probe from "gotcha" into "ok, that's fine".
Common Confusion / Misconceptions
"If we have horizontal scale, there is no bottleneck." There is always a bottleneck; it moves. Horizontal scale moves the bottleneck from CPU to coordination, from coordination to network, from network to the central allocator, from the central allocator to the storage IOPS. Name the current one.
"If we have replicas, it is not a SPOF." A replica set is still a SPOF at the cluster level if every replica fails from the same cause (software bug, shared dependency, cascading load). The test is "can this whole thing fail at once?"
"SPOFs are only hardware." Services with single-instance dependencies (a centralized config store, a central ID service, a single team's approval-required migration tool) are SPOFs even if they are software.
"We'll fix SPOFs later." Some, yes. But every accepted SPOF should be in the doc, with its business risk written down. "We accepted this SPOF because it has been up for 4 years and the business can tolerate 10 minutes of downtime per failure" is a sentence a reviewer can engage with.
"Bottleneck and SPOF are the same list." They are correlated but different. The slowest component is often not the most dangerous failure. A blazing-fast dispatcher with no replica is a worse SPOF than a slow but redundant analytics query path. Rank them separately.
How To Use It
Per design, produce two lists:
Bottleneck list (ranked by current + projected impact):
For each: the component, the constraint, the remediation, the cost.
SPOF list (ranked by blast radius):
For each: the component, the failure effect, the remediation or the accept-with-reason.
Put both at the end of the design doc. A reviewer reading for 90 seconds should be able to find both lists.
Transfer / Where This Shows Up Later
- Cluster 5 concept 14 (trade-offs) converts every "accept-with-reason" line here into a formal trade-off statement with rejected alternatives and cost.
- Cluster 5 concept 15 (design doc) mandates a dedicated "Bottlenecks and SPOFs" section -- this is its content.
- S8M4 (scale, reliability, performance) turns the bottleneck list into concrete load-test scenarios and capacity plans.
- S8M5 (technical leadership) turns the SPOF list into risk-register entries that get tracked across quarters.
- S9 (cloud + DevOps) implements remediations (multi-AZ, multi-region, managed services with built-in HA) and wires SLO-driven alerts to the SPOF list so that regressions are caught before users feel them.
- S10 capstone/interviews: "what's the SPOF?" is the single most common closing question. Having the ranked list already in the doc is the signal a senior-level candidate produces reliably.
Check Yourself
- Give one example of a bottleneck that is not a SPOF, and one example of a SPOF that is not a bottleneck.
- Why is the very-low-latency component (e.g., L7 LB) often the most dangerous SPOF, even though it rarely fails?
- What would you put in writing to defend an accepted SPOF?
- Name a "soft SPOF" in a typical web architecture (a component that is replicated but still fails as a unit) and describe the failure mode.
Mini Drill or Application
For each of your designs from previous clusters:
- List the top 3 bottlenecks and top 3 SPOFs.
- For each, choose one of: fix-now, fix-in-phase-2, accept-with-reason.
- For every "accept-with-reason" entry, write the sentence you would put in the design doc and defend in review.
Compare your list to a peer's for the same design if possible. Overlap indicates both of you have calibrated; divergence indicates you are still pattern-matching different things.
Read This Only If Stuck
- System Design Primer: Availability patterns -- fail-over classes and their blast radius.
- System Design Primer: Load balancer -- the LB as canonical SPOF candidate and its HA topologies.
- System Design Primer: Database federation and sharding -- shard-level failure vs global failure.
- System Design Primer: Performance vs scalability -- the vocabulary for ranking bottlenecks.
- System Design Primer: Real-world architectures -- how real systems document their SPOFs.
- Fundamentals: Measuring architecture characteristics -- turning "availability" into a measurable constraint that SPOFs directly violate.
- Fundamentals: Architecture characteristics ratings (reliability, fault tolerance) -- scoring different styles on their SPOF profiles.
- AWS Well-Architected Framework -- Reliability Pillar -- the industry-standard checklist for SPOF discovery.
- Amazon Builders' Library -- Static stability using Availability Zones -- how to design so an AZ-level SPOF does not cascade.
- Google SRE Workbook -- Managing risk -- how to rank SPOFs/bottlenecks against an error budget.
- High Scalability -- case studies where specific SPOFs caused the dominant outage.