The TLB and Its Caching Role

What This Concept Is

The translation lookaside buffer (TLB) is a small, very fast cache inside the CPU that stores recent virtual-to-physical translations. On every memory access:

The CPU takes the VPN and probes the TLB.
On a hit, it takes the cached PFN, combines with the offset, and goes straight to DRAM (or L1 cache).
On a miss, it performs the page-table walk (in hardware on x86/ARM, in software on some MIPS/RISC-V configurations), installs the translation in the TLB, and retries.

A typical modern CPU has separate instruction and data TLBs per core, each with tens to hundreds of entries, plus a larger unified L2 TLB. Huge-page entries live in (sometimes separate) slots.

Why It Matters Here

Paging would be unaffordable without the TLB. A 4-level walk costs up to 4 DRAM reads; at ~100 cycles each, that is hundreds of cycles per load or store if you miss. The TLB makes almost all accesses free, so paging's amortized cost is dominated by the miss rate, not the walk length.

Several later concepts only make sense once you believe the TLB is what makes paging cheap:

huge pages reduce TLB pressure by covering 2 MiB or 1 GiB per entry
TLB shootdowns happen during unmap/munmap/context-switch and stall multiple cores
ASID / PCID (address-space identifiers / process-context identifiers) let the TLB keep translations across context switches so the TLB does not need to flush
NUMA and allocator choices interact with TLB behavior at large RSS

If you ever see a profile where performance depends on stride length in weird nonlinear ways, you are probably watching TLB behavior.

Concrete Example

A cost sketch. Suppose L1 cache is 1 ns, DRAM is 100 ns, and a 4-level page-table walk touches 4 cache-cold entries at 100 ns each.

Event	Cost
TLB hit, L1 hit	~1 ns
TLB hit, DRAM hit	~100 ns
TLB miss, walk entries all in L2/L3 cache	~10-40 ns extra
TLB miss, walk entries all cache-cold	~400 ns extra

A workload touching 64 KiB with 4 KiB pages fits in 16 TLB entries; a workload touching 64 MiB needs 16,384 entries, which will not fit in any L1 TLB. The second workload's throughput can be dominated by TLB miss walks.

A measurement example. Walk an array of N doubles with stride S, varying S. Throughput is flat at small S (cache and TLB win), then drops at the point where each touch hits a new cache line (cache miss), then drops again at the point where each touch hits a new page (TLB miss). Two distinct knees in the curve.

A real TLB entry holds: VPN, PFN, valid bit, protection bits, global bit (shared across address spaces, typically for kernel mappings), ASID / PCID (to tag the translation by process), and often a "large page" bit.

Common Confusion / Misconception

"TLB miss = page fault." No. A TLB miss is a cache miss on the translation; the page table exists and is walked, no OS is involved. A page fault is a missing or protection-violating translation in the page table itself, and the OS is invoked. TLB misses are common and cheap; page faults are rare and expensive.

"A context switch flushes the TLB." Modern x86 and ARM use PCID/ASID so the TLB can hold many processes' translations simultaneously. Switching processes usually just changes the tag being searched for, not a flush. Kernel-user transitions on mitigations-enabled systems may still flush.

"Bigger TLB always better." A bigger TLB means a slower TLB (higher access latency and higher power). Designs compromise with a small L1 TLB plus a bigger L2 TLB.

How To Use It

When analyzing a memory-bound workload, ask:

Roughly how many unique pages does the hot inner loop touch?
Compare against TLB sizes (on Linux, see /proc/cpuinfo and cpuid -1; for concrete sizes you can check vendor docs). If working set exceeds TLB reach, you will see TLB misses dominate.
Would huge pages change TLB reach by 512x (2 MiB pages)? If so, test them.
Is the workload sensitive to stride in a nonlinear way? That is a TLB signature.

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./prog gives direct counters.

Check Yourself

What is the difference between a TLB miss and a page fault in terms of who handles it and how much it costs?
Why does a large-stride traversal of a big array show worse throughput than a small-stride traversal, even when byte counts are equal?
Why do PCID/ASID tags remove most of the cost of context switches for TLB?
What is "TLB reach," and why do huge pages typically improve it by a large factor?

Mini Drill or Application

Compute TLB reach for a 64-entry data TLB at 4 KiB pages, then at 2 MiB pages. Ratio?
Your program's hot loop touches 16 MiB of data sequentially in chunks of 64 bytes. Predict where the stride knees lie: L1 cache, L2 cache, page boundary.
On x86-64, the L2 TLB is typically around 1,024-2,048 entries. What working-set size does that cover with 4 KiB pages? With 2 MiB pages?
Why is a TLB shootdown usually slower than a local TLB flush?
Design a small program to measure TLB miss cost: explain how stride affects what you measure. (You will build this in Practice 1.)

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​