Caching Strategies at Every Layer

What This Concept Is

A cache stores the result of an expensive computation or fetch so subsequent identical requests can skip the work. Caches exist at every layer of a modern stack:

Client cache: browser cache, OS-level HTTP cache, mobile app cache. Free compute; cannot be invalidated from the server.
CDN cache: edge PoPs that serve static (and sometimes dynamic) content close to the user. Microsecond-millisecond hits; global inconsistency windows.
Reverse-proxy / web-server cache: Nginx, Varnish, HAProxy serving recently fetched responses. Single-digit-millisecond hits.
Application cache: in-process memoization and shared in-memory stores (Memcached, Redis). Sub-millisecond in-process, low-millisecond over network.
Database cache: the DB's own buffer pool and query cache. Hidden from the application; still one of the largest performance levers.

On top of the "where" question is a "how" question: four update patterns govern the consistency of the cache.

Cache-aside (lazy loading): the application reads from cache; on miss, it reads from the DB and populates the cache. Writes go to the DB only. AWS calls this "the most prevalent form of caching" and recommends it as "the foundation of any good caching strategy" - it is the default unless you have a reason to pick something else.
Write-through: the application writes to the cache; the cache synchronously writes to the DB. Reads are always fresh.
Write-behind (write-back): the application writes to the cache, which asynchronously flushes to the DB.
Refresh-ahead: the cache proactively refreshes likely-to-be-read entries before TTL expires.

Each layer and each pattern has a distinct failure mode. Cache invalidation and naming are Phil Karlton's two hard problems in computer science for a reason.

Why It Matters Here

Caching is the single largest performance lever in most systems. A hit rate of 95% against a 1ms cache and a 50ms database produces effective latency of 0.95*1 + 0.05*50 = 3.45ms - roughly a 15x improvement. At 99% hit rate against a 100ms dependency, effective latency is ~2ms - a 50x improvement. Caching turns a DB-bound service into a cache-bound service, which is almost always cheaper.

But caching also introduces:

Staleness: the user sees data that was correct five minutes ago.
Thundering herds (also called dog-piling): a popular key expires and 10,000 concurrent requests hit the DB to refill it. AWS describes the same failure under a new-cache-node scenario: "the new cache node's memory is empty... your database might suddenly be swamped with a series of identical queries."
Write amplification: write-through doubles writes; write-behind risks loss on cache failure.
Hidden coupling: the cache becomes a required dependency; losing it is an outage, not a degradation.
Eviction surprise: LRU/LFU policies evict under memory pressure; your 99% hit rate can collapse in a minute when a new key pattern sweeps through memory.

You cannot write a scaling plan that ignores caching. You also cannot write one that treats the cache as magic.

Concrete Example

A product page is read 10,000 times per second and updated twice per day.

Cache-aside (Memcached). First request misses; fetches from DB (50ms); stores in cache with 1-hour TTL. Next ~3.6M requests hit cache (<1ms). When the product is updated, the app must either (a) delete the cache key, so the next read repopulates it, or (b) let the TTL expire (up to one hour of stale reads). This is the System Design Primer's cache-aside pseudocode and AWS's lazy-caching pattern almost exactly. Risk: on cache node failure, all 10k req/s hammer the DB - a thundering herd.

Write-through. The app writes the updated product to the cache, which atomically writes to the DB. Reads from cache are always fresh. Risk: every write is now slower (synchronous double write); a product never read from cache is never cached (data in cache is cold). AWS notes "it can result in lots of cache churn if certain records are updated repeatedly" and "the cache can be filled with unnecessary objects that aren't actually being accessed" - wasted memory.

Write-behind. The app writes to the cache; a background flush persists to the DB. Writes feel instant; reads never stale. Risk: if the cache crashes before the flush, the update is lost. Requires durable cache or async log.

Refresh-ahead. Minutes before TTL expires, the cache refreshes the entry on its own. Risk: wastes work refreshing entries nobody will read again; amplifies load on the DB for inaccurate predictions.

Russian-doll caching (Rails pattern). Nested records are cached independently with keys that include version numbers; the parent's cache key is a collection of child keys. Invalidating a child invalidates only its ancestors. Popularized by the Basecamp team for long-lived user-generated content.

Numbers from a real scenario. A product service at 10k RPS with p99 DB latency 50ms and p99 cache latency 1ms:

Uncached p99: 50ms.
At 95% hit rate: effective mean latency 0.95*1 + 0.05*50 = 3.45ms; effective p99 often dominated by the 5% miss tail, so p99 may still be near the DB's p99 (50ms). Hit rate matters for mean; miss-path capacity matters for tail.
At 99% hit rate with a 200ms DB p99 on misses: effective mean 0.99*1 + 0.01*200 = 2.99ms, but the 1% of users still see 200ms. The path you cannot escape is the miss path; cache plans must include its headroom.
Storage: 10M keys * ~500 bytes each = ~5GB in Redis; trivial. The expensive cache is the busy one, not the large one.
Thundering-herd math: if one hot key expires, 10k RPS flow to the DB simultaneously until refill; a single-flight lock caps this to 1, costing ~50ms to 1 request and serving stale (or briefly blocked) responses to the rest.

Common Confusion / Misconception

"Redis is a cache." Redis can be a cache. It is also sometimes used as a session store, a queue, a leaderboard, and a primary data store with persistence. Which role it plays changes how you treat it during an outage. If Redis holds a write-behind cache with unflushed writes, a Redis outage is data loss, not cache invalidation.

"We need strong consistency, so no cache." Sometimes true. More often, you want bounded staleness: "within one minute of the DB." Cache-aside with a short TTL gives you that, cheaply. AWS explicitly recommends this for rapidly-changing data: "rather than adding write-through caching or complex expiration logic, just set a short TTL of a few seconds."

"We have a cache, so scaling is solved." Caches mask load; they do not eliminate it. Unless every layer degrades gracefully when cache hit rate drops (e.g., after a deploy that empties the cache), you still have a scaling problem - you just hid it. "Prewarm the cache" before enabling a new node is a standard AWS recommendation for exactly this reason.

"The same TTL everywhere is fine." No. If every key in a cache expires within a minute, you get a synchronized thundering herd. AWS's fix is one line of code: ttl = 3600 + (rand() * 120) - add jitter so expirations spread across the window.

"Cache hit rate is our KPI." Hit rate is a cost metric, not a correctness metric. A cache with 99.9% hit rate that serves 5-minute-old data when the SLO requires 30-second freshness is failing. Measure staleness alongside hit rate.

How To Use It

For every hot read path:

Pick the highest layer where the cache is correct. A static asset goes on the CDN; user-specific data usually cannot. Do not miss the free tier (HTTP Cache-Control and ETag headers) before adding infrastructure.
Pick the update pattern that matches the read/write ratio and staleness tolerance. Heavy read, low write, some staleness OK -> cache-aside. Write-heavy, read-often, must be fresh -> write-through. Write-heavy, loss-tolerant -> write-behind. AWS: use lazy caching as the foundation; apply write-through "as a targeted optimization."
Set TTLs explicitly and with jitter. A cache without a TTL is a slow leak toward wrong answers. A cache with identical TTLs is a synchronized thundering herd waiting to happen.
Plan the invalidation strategy. Event-driven invalidation is best; TTL-only is cheapest; manual invalidation scales poorly.
Protect against thundering herds: single-flight locks (only one request per key goes to origin), jittered TTLs, serve-stale-while-revalidate, prewarming on new node introduction.
Design the cache-miss path so you do not die when it runs: rate-limit DB fetches, pre-warm on deploy, use circuit breakers between the cache-miss path and the DB.
Know your eviction policy. Redis default is volatile-lru (LRU among keys with TTL); alternatives include allkeys-lru, allkeys-lfu, volatile-ttl. Match the policy to your access pattern.

Check Yourself

For what kind of data is write-through the correct pattern, and why does it fail for data that is almost never read?
What is a thundering herd and name two mitigations.
Why is "we'll just add a cache" insufficient as a scaling plan?
Why does adding rand() to TTLs matter, and what failure does it prevent?
What is the Russian-doll caching pattern and what problem does it solve that flat cache-aside does not?
A service has a 99% hit rate and a cold-cache event drops it to 0% instantly. Given a 50ms DB p99, estimate effective p99 during the cold window and whether the DB can sustain the miss load.
When is a CDN the wrong cache layer even for static assets?

Mini Drill or Application

Pick one hot read path and one hot write path in a service you know. For each, write: "Cache layer: _", "Update pattern: _", "TTL: _", "Invalidation trigger: _", "Behavior on cache outage: _", "Behavior on cold cache after deploy: _". Any blank cells are your next design problem. Then compute: if the cache is cold and the DB is the fallback, at current peak req/s, how long until the DB saturates? That is your window to pre-warm.

Transfer / Where This Shows Up Later

Caching shows up in every system you will ever scale, and it is the source of an alarming fraction of incidents. Expect to revisit this concept repeatedly.

This module, concept 05 (statelessness): the external session store is a cache; every rule here applies to it.
This module, concept 08 (failure modes): cache stampedes are a canonical cascading failure; single-flight locks and jittered TTLs are the textbook mitigations.
This module, concept 11 (load shedding): a cache-miss path without bounded concurrency is an admission-control failure; the DB is the shed point by default unless you build one in.
This module, concept 13 (observability): hit rate, staleness distribution, and eviction rate are three separate metrics; shipping only hit rate is a correctness risk.
S8 M3 (event-driven): event-driven invalidation (emit "product.updated"; consumer deletes the cache key) is the cleanest way to run cache-aside at scale.
S9 M1 (cloud platform): managed caches (ElastiCache, Memorystore, DAX) have their own failure modes - memory-pressure eviction, node-failover blip, TLS-termination overhead - that you own.
S10 M4 (operational readiness): "what happens when the cache is cold" is a capstone-review question. The answer is always a runbook, never a hope.

Read This Only If Stuck

Local chunks (book anchors)

System Design Primer: Cache Overview and Levels -- the short taxonomy of cache layers; read first if you are unsure where to put one.
System Design Primer: Cache Update Patterns -- the four patterns in pseudocode; memorize them.
System Design Primer: Content Delivery Network -- the highest cache layer; almost always free performance if your content allows it.
System Design Primer: Reverse Proxy -- the next layer down; Varnish/Nginx cache is a massive untapped lever in many stacks.
System Design Primer: Consistency Patterns -- every cache pattern is a deliberate choice of staleness; this chapter names the choices.
FoSA: Replicated versus Distributed Caching -- the two topologies and their USL implications; essential reading before you pick a cache cluster shape.
FoSA: Architecture Characteristics Ratings - Space-Based -- a worked architectural trade-off for the cache-heavy style.

External canonical references

AWS, Caching Best Practices and Performance at Scale with Amazon ElastiCache (whitepaper) -- production patterns for TTL jitter, prewarming, and node churn.
AWS Builders' Library, Caching challenges and strategies -- Amazon's internal doctrine; read the sections on thundering herds and cache-miss stampedes in full.
Facebook/Meta, Scaling Memcache at Facebook (NSDI 2013) -- the industry reference for very large, very fast cache tiers; includes the lease mechanism that solves stampedes at origin.
Basecamp (DHH), The performance impact of Russian doll caching -- the nested-keys pattern for invalidation-heavy content.
Marc Brooker, Caches: the good, the bad, and the ugly -- an AWS principal's tour of cache anti-patterns and metastable failure modes.
Martin Kleppmann, Designing Data-Intensive Applications, ch. 1 + ch. 7 (latency concerns and transactions) -- context for why caching is not "free consistency."
Discord, How Discord scaled Elixir to 5M concurrent users -- the "manifold" cache pattern for message fan-out at scale.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Transfer / Where This Shows Up Later​

Read This Only If Stuck​

Local chunks (book anchors)​

External canonical references​