Virtual Memory: Pages, Page Tables, and the TLB

What This Concept Is

Virtual memory is the illusion that every process has its own flat address space. In reality the CPU translates each virtual address to a physical address on every access via a page table the OS maintains. The unit of translation is a page, typically 4 KiB (large/huge pages give 2 MiB, 1 GiB).

For a typical x86_64 process:

virtual address = 48 bits (sign-extended to 64)
physical address = up to 52 bits
translation walks a four-level radix tree (PML4 -> PDPT -> PD -> PT) producing a page-table entry (PTE)
the PTE contains the physical page number plus permission bits (read/write/execute, user/supervisor, present, dirty, accessed)

Because walking the page table on every load/store would be ruinous, translations are cached in the translation lookaside buffer (TLB), a small associative memory (typically 64-256 entries for L1 dTLB, 1000+ for L2 TLB). A TLB hit adds roughly zero cycles to a memory access; a TLB miss triggers a page-table walk that may take dozens to hundreds of cycles, especially if the walk itself misses the caches.

On a real miss (no PTE yet, or the page is on disk), the CPU raises a page fault exception, and the OS decides what to do: allocate a new frame, read the page from swap, kill the process, and so on.

Why It Matters Here

Virtual memory connects the topics in this module together:

It turns pointer chasing in a big heap into a mix of cache misses and TLB misses.
It explains why huge pages speed up large-working-set workloads disproportionately: fewer TLB entries cover the same bytes.
It is the layer that lets a single physical machine safely run multiple processes, which sets up everything in Module 4.
It is the main source of "why did this program slow down after 20 seconds?" surprises -- the page cache got evicted, or the TLB started missing as the heap grew.

Concrete Example

Translate 0x00007ffe_12345678 on a 4-level, 4 KiB page table:

virtual = 0000 0000 0000 0000 0111 1111 1111 1110 0001 0010 0011 0100 0101 0110 0111 1000

break into 9 + 9 + 9 + 9 + 12 bits (PML4 | PDPT | PD | PT | offset)
PML4 index  = 0x0FF
PDPT index  = 0x1F1
PD   index  = 0x023
PT   index  = 0x045
offset      = 0x678

Translation walk:

PML4[0x0FF].ptr  -> PDPT base
PDPT[0x1F1].ptr  -> PD   base
PD  [0x023].ptr  -> PT   base
PT  [0x045].pfn  -> physical frame 0x12345
physical address = (0x12345 << 12) | 0x678 = 0x12345678

If the TLB had a cached entry for virtual >> 12 = 0x07ffe12345, we skip all four reads. If not, we do four reads (themselves possibly in the L2 or L3 cache) before the intended load can proceed.

Common Confusion / Misconception

"Virtual memory is just swap." Swap is a minor feature on modern systems with large DRAM. The core job of VM is isolation, permissions, and the ability to lay out a flexible address space (e.g. mmap, shared libraries at arbitrary addresses, guard pages, read-only .rodata).

Another trap: assuming a cache hit is free regardless of TLB state. You can hit in L1 and still pay for a TLB miss, because the hardware probes L1 with the virtual address but must still produce the physical tag to confirm -- if the TLB does not have the translation, the core stalls even though the data is right there.

How To Use It

For large working sets (>>2 MiB), request huge pages (MAP_HUGETLB, madvise(MADV_HUGEPAGE)). One 2 MiB huge page covers as much as 512 regular pages with a single TLB entry.
When a program slows down over time, suspect TLB shootdowns (on page unmaps) and page-cache churn as well as classic cache churn.
Use perf stat -e dTLB-load-misses,iTLB-load-misses to see if translations are the bottleneck.
Design allocators to be page-aware: arena allocators that reuse pages are kinder to the TLB than bump allocators that spread across many pages.

Check Yourself

Why does page-based translation walk a multi-level radix tree rather than a single flat table?
What is the TLB, and what does a TLB miss cost you?
Why do huge pages reduce TLB pressure?
What kinds of permissions can a PTE encode, and when does each matter?

Mini Drill or Application

On Linux, read /proc/self/maps for a running program. Identify the code, data, heap, stack, and shared-library regions. Note which are r-x, which are r--, and which are rw-.
Write a small program that allocates a 2 GiB array, touches every page, and measures perf stat -e dTLB-load-misses. Repeat with mmap(..., MAP_HUGETLB, ...) and compare.
Given a page-table PTE encoding, decode a specific PTE's physical page number and permission bits. (Use the Intel manual's PTE format.)
Read /proc/self/status and observe VmPeak, VmRSS, and VmSwap. Explain the difference between virtual size and resident set size in the context of mmaping a file larger than RAM.

Going Deeper

Copy-on-write (COW): fork does not copy the address space; it marks pages read-only and copies only on the first write. A short-lived child after a large parent's fork is essentially free.
Demand paging: newly allocated anonymous pages are zero-filled on first touch. Your malloc(1 GiB) does not cost 1 GiB of physical memory until you write to it.
Page-table shootdowns: when a CPU changes a PTE (e.g., during munmap), it must invalidate TLBs on every core that might have cached the translation. This is an IPI-driven operation and can dominate multi-threaded workloads with high unmap rates.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Going Deeper​

Read This Only If Stuck​