The Memory Hierarchy: Registers, Cache, DRAM, and Disk
What This Concept Is
A single flat "RAM" does not exist in real machines. Instead, storage is layered, with each level faster and smaller than the one below. A representative hierarchy on a 2025-era x86_64 desktop looks like this:
| Level | Typical size | Typical latency | Typical bandwidth |
|---|---|---|---|
| Registers | ~32 × 64-bit | 0 cycles | -- |
| L1 data cache | 32-64 KiB | 4-5 cycles (~1 ns) | 100+ GB/s |
| L2 cache | 256 KiB-1 MiB | ~12 cycles (~3 ns) | ~60 GB/s |
| L3 cache | 8-64 MiB | ~40 cycles (~10 ns) | ~40 GB/s |
| DRAM | 8-128 GiB | ~80-120 ns | 30-100 GB/s |
| NVMe SSD | ~TB | ~20-100 µs | 3-10 GB/s |
| Rotating disk / network | TB-PB | ms | much less |
Each level is a cache for the slower level below it. The hierarchy works because of locality:
- temporal locality -- a byte you touched recently is likely to be touched again
- spatial locality -- a byte next to one you touched is likely to be touched soon
Why It Matters Here
Without the hierarchy, a load from DRAM would always cost ~100 ns -- about 300-400 instructions worth of wasted time on a 3 GHz core. The hierarchy amortizes that cost by keeping working sets close. If your working set fits in L1, every access is effectively free; if it spills to DRAM every time, your throughput collapses to memory bandwidth.
Nearly every performance surprise in systems programming traces back to "the working set stopped fitting." Knowing the rough sizes above is the first tool for explaining surprise slowdowns.
Concrete Example
Consider summing an array of N ints:
long sum(const int *a, int N) {
long s = 0;
for (int i = 0; i < N; ++i) s += a[i];
return s;
}
A 4 KB array (1024 ints) fits entirely in L1. Every iteration issues one load that hits at ~1 ns; the inner loop is bounded by ALU/branch throughput, not memory. A 1 GB array does not fit anywhere but DRAM. Every cache line (64 bytes, i.e. 16 ints) is fetched exactly once, at ~100 ns per miss. That is ~6 ns per element, limited by the memory bus.
The ratio between these two regimes can easily be 6x -- same code, same big-O. The hierarchy is why.
You can see the "staircase" effect by plotting time-per-element against array size: flat in L1, step up at L2, step up at L3, step up at DRAM.
Common Confusion / Misconception
"DRAM is fast because accessing it is just one memory operation." At the ISA level yes, but at the physical level no. A DRAM row-activate + column-read takes ~50-80 ns on its own, and the CPU is waiting the whole time. Treat DRAM as a peripheral the cache protects you from, not as a fast extension of registers.
Also: "Modern SSDs are so fast it does not matter anymore." Measured in absolute latency they are 1000x slower than DRAM. What changed is throughput and queue depth, not individual access latency.
How To Use It
- When you get a surprising slowdown, ask: did the working set just grow past a cache boundary? Pull up your platform's cache sizes (
lscpu,sysctl hw.l1dcachesize, or CPU-Z). - When designing a data structure, aim the hot path at L1. Cold data belongs in L3 or DRAM, not interleaved with hot fields.
- When asked "is this memory-bound or compute-bound?" compute the arithmetic intensity (flops per byte read) and compare against the machine's bandwidth-to-flops ratio. That is the essence of the roofline model.
- When in doubt, measure.
perf stat -e L1-dcache-load-misses,LLC-load-misses,cycles,instructions ./proggives a fast diagnostic.
Check Yourself
- List the five main hierarchy levels from fastest to slowest and give order-of-magnitude latencies.
- Why does the hierarchy work at all? Name the two localities.
- Why is a cache miss that goes all the way to DRAM so expensive in cycles?
- What does "the working set fits in L2" mean operationally?
Mini Drill or Application
Run the classic size-sweep microbenchmark:
// Pseudocode -- fill in a real driver in C
for (size_t N = 1 KiB; N <= 256 MiB; N *= 2) {
int *a = malloc(N);
time how long it takes to sum all elements R times
print N, time / (N/4 * R)
}
Plot time-per-element against N. Identify the plateaus corresponding to L1, L2, L3, and DRAM. Annotate each plateau with the CPU's actual cache size from lscpu.
Then change the stride: instead of sequential access, read a[0], a[stride], a[2*stride], .... For stride = 64 bytes (one cache line) you should see bandwidth drop because prefetchers detect the pattern but each line is used only once. For a larger random stride, DRAM-access latency -- not bandwidth -- becomes the bound.
Common Latency Numbers Worth Memorizing
Rough order-of-magnitude values every programmer should carry in their head:
- L1 hit: ~1 ns (≈3-5 cycles at 3 GHz)
- L2 hit: ~3 ns
- L3 hit: ~10 ns
- DRAM: ~100 ns (≈300 cycles -- "30 cents on the dollar" of a millionth of a second)
- NVMe: ~50 µs
- Cross-datacenter network: ~1 ms
These ratios -- not the exact numbers -- are the mental model worth keeping.
Read This Only If Stuck
- Computer Organization and Design: 5.1 Introduction
- Computer Organization and Design: 5.1 Introduction (Part 2)
- Computer Organization and Design: 5.1 Introduction (Part 3)
- Computer Organization and Design: 5.10 Real Stuff -- Nehalem/Opteron Memory Hierarchies
- External: Ulrich Drepper, "What Every Programmer Should Know About Memory" (LWN)