Skip to main content

Load Shedding, Rate Limiting, and Admission Control

What This Concept Is

When arrivals exceed capacity, you have three families of tools.

  • Rate limiting: cap the rate at which a caller may submit requests. Usually applied per-client (per API key, per user, per IP) using a token bucket or leaky bucket. Purpose: fairness and protection from abuse.
  • Load shedding: under system-level overload, drop some fraction of incoming requests rather than serve all of them slowly. Purpose: keep the service healthy for the requests you do accept.
  • Admission control: decide, at the entry point, whether to accept a request at all based on queue depth, expected processing cost, or priority. Purpose: protect internal SLOs by rejecting work whose latency cost exceeds its value.

The three overlap. A rate-limited 429 is a form of shed load. An admission-control policy may implement rate limiting as a sub-case. The distinction is intent: rate limiting protects you from one abusive client; load shedding protects the system from aggregate overload; admission control expresses a priority policy.

Why It Matters Here

The math of the previous concept says: past ~80% utilization, adding more work makes everything slower. When load is rising, your choices are:

  1. Accept more work and deliver it slowly (everyone suffers, including requests that have already been accepted).
  2. Reject some work immediately so accepted work completes promptly.

Option 2 is almost always correct. The Google SRE Workbook makes it explicit: a slow response is often worse than a fast rejection, because the client is still paying the resource cost and cannot retry against a healthy peer.

Without these tools, overload cascades: clients retry, queues grow, timeouts fire, circuit breakers trip, the outage widens. With them, overload is bounded to "some users got 429s" - recoverable and observable.

Concrete Example

Token bucket rate limiter. A bucket holds up to B tokens and refills at R tokens per second. Each request takes 1 token. If the bucket is empty, the request is rejected (or waits).

  • B = 100, R = 10/s: a user can burst 100 requests, then sustain 10/s. This matches most public APIs (short bursts OK, long-run rate bounded).
  • B = 1, R = 10/s: strict 10/s, no bursts. Stricter.

Concurrency limit (Netflix's adaptive concurrency). Instead of rate, cap the number of in-flight requests. Little's Law says L = lambda * W: if you cap L, you cap the latency (given a capacity on W). When the limit is reached, reject or queue briefly. This adapts better than a fixed RPS cap to variable request costs.

Load shedding with priority. Classify requests into tiers: user-critical (logins, payments), user-important (product page), internal (metrics scraping, low-priority jobs). Under overload, shed low-priority first. Example: at 80% CPU, start rejecting internal tier; at 90%, also reject user-important; never shed user-critical. This is how Netflix's prioritized load-shedding works - retention is better preserved by protecting the user path.

Admission control with expected cost. The server estimates the cost of each request (e.g., "search with 5 filters costs ~200ms") and rejects if the estimated cost would violate a per-request SLO. Kubernetes, database query planners, and some CDN edges implement variants of this.

Common Confusion / Misconception

"Rate limiting protects our service." Only partly. A per-client rate limit stops one client from overwhelming you; it does not stop 1000 clients from collectively arriving at 1000 * limit. For aggregate overload, you need system-level load shedding.

"Dropping requests is a bad user experience." Slow requests are worse. A 429 with a Retry-After header lets the client back off, queue locally, or render a graceful degradation. A 30-second pending request that eventually fails cost the user the same time and the resources and the hope.

"We'll increase the thread pool to absorb spikes." The thread pool is just an in-process queue. Making it bigger makes individual requests slower under overload (Little's Law). Bounded pool + shed excess beats unbounded pool + silent latency.

How To Use It

For every externally-exposed surface:

  1. Rate-limit per-caller. API key, user, IP. Publish the limit. Return 429 with Retry-After when exceeded. This is the cheapest protection against abuse.
  2. Shed load system-wide. Pick a signal (CPU, queue depth, latency, error rate) and a threshold. Above the threshold, reject a fraction of new requests before they reach the expensive path. Prefer concurrency-based limits (adaptive) over fixed RPS.
  3. Prioritize. Not all requests are equal. Your health check and an internal cron job should not contend with a user login.
  4. Return fast. A dropped request should consume far less resource than a served one. If the drop path does heavy logging, JSON serialization, or DB lookups, it is not protecting you.
  5. Observe the shed. Instrument how many requests you shed, by tier, by reason. A silent shed is indistinguishable from a silent bug.

Check Yourself

  1. Why is a 429 better than a successful-but-slow response under overload?
  2. Name one advantage of a concurrency limit over an RPS limit.
  3. What is the difference between rate limiting and load shedding?

Mini Drill or Application

Pick a service. Write its overload playbook: per-client rate limit value, system-wide shed threshold and signal, priority tiers, expected 429 rate at headroom exhaustion, alert when shed rate exceeds baseline. Run through "what happens at 2x load" mentally.

Read This Only If Stuck