Virtual Memory: Pages, Page Tables, and the TLB
What This Concept Is
Virtual memory is the illusion that every process has its own flat address space. In reality the CPU translates each virtual address to a physical address on every access via a page table the OS maintains. The unit of translation is a page, typically 4 KiB (large/huge pages give 2 MiB, 1 GiB).
For a typical x86_64 process:
- virtual address = 48 bits (sign-extended to 64)
- physical address = up to 52 bits
- translation walks a four-level radix tree (PML4 -> PDPT -> PD -> PT) producing a page-table entry (PTE)
- the PTE contains the physical page number plus permission bits (read/write/execute, user/supervisor, present, dirty, accessed)
Because walking the page table on every load/store would be ruinous, translations are cached in the translation lookaside buffer (TLB), a small associative memory (typically 64-256 entries for L1 dTLB, 1000+ for L2 TLB). A TLB hit adds roughly zero cycles to a memory access; a TLB miss triggers a page-table walk that may take dozens to hundreds of cycles, especially if the walk itself misses the caches.
On a real miss (no PTE yet, or the page is on disk), the CPU raises a page fault exception, and the OS decides what to do: allocate a new frame, read the page from swap, kill the process, and so on.
Why It Matters Here
Virtual memory connects the topics in this module together:
- It turns pointer chasing in a big heap into a mix of cache misses and TLB misses.
- It explains why huge pages speed up large-working-set workloads disproportionately: fewer TLB entries cover the same bytes.
- It is the layer that lets a single physical machine safely run multiple processes, which sets up everything in Module 4.
- It is the main source of "why did this program slow down after 20 seconds?" surprises -- the page cache got evicted, or the TLB started missing as the heap grew.
Concrete Example
Translate 0x00007ffe_12345678 on a 4-level, 4 KiB page table:
virtual = 0000 0000 0000 0000 0111 1111 1111 1110 0001 0010 0011 0100 0101 0110 0111 1000
break into 9 + 9 + 9 + 9 + 12 bits (PML4 | PDPT | PD | PT | offset)
PML4 index = 0x0FF
PDPT index = 0x1F1
PD index = 0x023
PT index = 0x045
offset = 0x678
Translation walk:
PML4[0x0FF].ptr -> PDPT base
PDPT[0x1F1].ptr -> PD base
PD [0x023].ptr -> PT base
PT [0x045].pfn -> physical frame 0x12345
physical address = (0x12345 << 12) | 0x678 = 0x12345678
If the TLB had a cached entry for virtual >> 12 = 0x07ffe12345, we skip all four reads. If not, we do four reads (themselves possibly in the L2 or L3 cache) before the intended load can proceed.
Common Confusion / Misconception
"Virtual memory is just swap." Swap is a minor feature on modern systems with large DRAM. The core job of VM is isolation, permissions, and the ability to lay out a flexible address space (e.g. mmap, shared libraries at arbitrary addresses, guard pages, read-only .rodata).
Another trap: assuming a cache hit is free regardless of TLB state. You can hit in L1 and still pay for a TLB miss, because the hardware probes L1 with the virtual address but must still produce the physical tag to confirm -- if the TLB does not have the translation, the core stalls even though the data is right there.
How To Use It
- For large working sets (>>2 MiB), request huge pages (
MAP_HUGETLB,madvise(MADV_HUGEPAGE)). One 2 MiB huge page covers as much as 512 regular pages with a single TLB entry. - When a program slows down over time, suspect TLB shootdowns (on page unmaps) and page-cache churn as well as classic cache churn.
- Use
perf stat -e dTLB-load-misses,iTLB-load-missesto see if translations are the bottleneck. - Design allocators to be page-aware: arena allocators that reuse pages are kinder to the TLB than bump allocators that spread across many pages.
Check Yourself
- Why does page-based translation walk a multi-level radix tree rather than a single flat table?
- What is the TLB, and what does a TLB miss cost you?
- Why do huge pages reduce TLB pressure?
- What kinds of permissions can a PTE encode, and when does each matter?
Mini Drill or Application
- On Linux, read
/proc/self/mapsfor a running program. Identify the code, data, heap, stack, and shared-library regions. Note which arer-x, which arer--, and which arerw-. - Write a small program that allocates a 2 GiB array, touches every page, and measures
perf stat -e dTLB-load-misses. Repeat withmmap(..., MAP_HUGETLB, ...)and compare. - Given a page-table PTE encoding, decode a specific PTE's physical page number and permission bits. (Use the Intel manual's PTE format.)
- Read
/proc/self/statusand observeVmPeak,VmRSS, andVmSwap. Explain the difference between virtual size and resident set size in the context ofmmaping a file larger than RAM.
Going Deeper
- Copy-on-write (COW):
forkdoes not copy the address space; it marks pages read-only and copies only on the first write. A short-lived child after a large parent'sforkis essentially free. - Demand paging: newly allocated anonymous pages are zero-filled on first touch. Your
malloc(1 GiB)does not cost 1 GiB of physical memory until you write to it. - Page-table shootdowns: when a CPU changes a PTE (e.g., during
munmap), it must invalidate TLBs on every core that might have cached the translation. This is an IPI-driven operation and can dominate multi-threaded workloads with high unmap rates.
Read This Only If Stuck
- Computer Organization and Design: 5.4 Virtual Memory
- Computer Organization and Design: 5.4 Virtual Memory (Part 3)
- Computer Organization and Design: 5.4 Virtual Memory (Part 5)
- Computer Organization and Design: 5.4 Virtual Memory (Part 8)
- Computer Organization and Design: 5.4 Virtual Memory (Part 11)
- External: Ulrich Drepper, "What Every Programmer Should Know About Memory" (LWN)