Skip to main content

Large Pages, NUMA, and Modern Wrinkles

What This Concept Is

Three features that modern CPUs and OSes add on top of the basic paging model, and that regularly appear in real performance work:

  • Large / huge pages. In addition to the standard 4 KiB page, x86-64 supports 2 MiB and 1 GiB pages; ARMv8 supports 16 KiB, 64 KiB, 2 MiB, 32 MiB, 512 MiB, 1 GiB depending on configuration. A single huge-page TLB entry covers much more address space, cutting TLB miss rate dramatically for workloads with big working sets. Two flavors on Linux: explicit hugetlbfs (reserved, administrator-controlled) and Transparent Huge Pages (THP) (kernel tries to promote runs of 4 KiB pages to 2 MiB automatically).
  • NUMA (Non-Uniform Memory Access). On multi-socket servers, each socket has its own memory controller and "local" DRAM; accessing another socket's DRAM is slower (typically 1.5-2x latency). The kernel tracks NUMA nodes and tries to keep processes' memory local to the CPU that runs them, via policies (numactl, set_mempolicy, auto-balancing).
  • Other modern wrinkles. Memory compression (zswap, zram), CXL-attached memory tiers, persistent memory (pmem, DAX), page migration, KSM (kernel same-page merging), idle-page tracking, userfaultfd, per-memcg reclaim in cgroup v2, PCID/ASID for context-switch-friendly TLBs.

These change cost models. A concept from 2005 that said "one page is 4 KiB and all RAM is equidistant" is wrong in 2025.

Why It Matters Here

These features move the performance envelope:

  • A large-TLB-reach workload (in-memory database, ML inference on a big model, JIT with a huge code heap) can run 10-30% faster with 2 MiB pages at no other cost.
  • A NUMA-oblivious workload on a dual-socket server can lose 20-50% throughput purely to cross-socket traffic. The fix is often just numactl --cpunodebind=0 --membind=0 ./prog or enabling NUMA balancing.
  • Enabling THP aggressively can hurt workloads that are fork-heavy or have many small short-lived allocations (CoW on 2 MiB pages copies 2 MiB per byte written, khugepaged compaction can stall).
  • zram as swap-to-RAM can delay OOM and improve responsiveness under pressure on memory-tight systems.

Being able to reason about these lets you explain measurements that appear to contradict simple paging models.

Concrete Example

TLB reach under huge pages. A CPU with a 1,536-entry L2 TLB:

  • 4 KiB pages: reach = 1,536 * 4 KiB = 6 MiB
  • 2 MiB pages: reach = 1,536 * 2 MiB = 3 GiB
  • 1 GiB pages: reach = 1,536 * 1 GiB = 1.5 TiB

A workload with a 1 GiB hot set on 4 KiB pages touches ~262k distinct pages; no TLB can hold that. On 2 MiB pages it touches 512 pages; fits easily. TLB miss rate collapses.

THP in action. cat /sys/kernel/mm/transparent_hugepage/enabled shows the current policy (always, madvise, or never). Under always, Linux opportunistically promotes runs of 4 KiB pages to 2 MiB when it can find contiguous frames. /proc/meminfo line AnonHugePages tracks how much anonymous memory is THP-backed.

NUMA. numactl --hardware shows nodes, memory per node, and node distances. A typical dual-socket box:

node 0 cpus: 0-31 memory: 128 GiB
node 1 cpus: 32-63 memory: 128 GiB
node distances:
0 1
0: 10 21
1: 21 10

Distance 10 = local, 21 = roughly 2x latency cost (relative, not absolute). A process pinned to CPUs on node 0 should ideally allocate from node 0's memory.

NUMA mistake. A database server allocates all its shared buffers on the thread that runs first (typical: node 0). Later threads on node 1 do almost all their accesses across the interconnect. Throughput drops. Fix: interleave with numactl --interleave=all or restructure allocation.

Common Confusion / Misconception

"Huge pages are always a win." THP in particular can hurt: promotion triggers memory compaction (which scans memory looking for contiguous free runs) and can cause stalls. CoW on a 2 MiB page amplifies fork divergence cost 512x relative to 4 KiB pages. Redis famously recommends disabling THP.

"NUMA doesn't matter on my machine." If you are on a single-socket machine, it largely doesn't. On two or more sockets, it almost always does. Modern AMD EPYC and Intel Xeon with chiplet designs even have NUMA-like effects within a single socket.

"CXL / persistent memory change nothing for programmers." They introduce new tiers with their own latency and durability profiles. Treat them as extra NUMA-like nodes and expect the kernel to migrate pages between tiers over time.

How To Use It

For a performance investigation:

  1. First check 4 KiB TLB miss rate (perf stat -e dTLB-load-misses). If it is significant relative to load count, huge pages are worth testing.
  2. For NUMA: numastat -m, numastat -c $pid, cat /proc/$pid/numa_maps. Check whether memory is local to the CPUs running the threads.
  3. For THP investigation: cat /proc/meminfo | grep AnonHuge, grep -i 'hugepage\|thp' /proc/vmstat.
  4. For compaction stalls: grep compact /proc/vmstat.

Almost every optimization here is "measure, adjust a knob (huge pages, NUMA policy, compaction), measure again."

Check Yourself

  1. Define TLB reach. Compute it for a TLB of 512 entries at 4 KiB and at 2 MiB.
  2. Explain one scenario where enabling THP globally makes performance worse.
  3. What does numactl --interleave=all do, and when is it appropriate?
  4. Why does CoW become more expensive when combined with THP?
  5. Name one feature from the "modern wrinkles" list and explain in one paragraph what cost-model change it introduces.

Mini Drill or Application

  1. Measure dTLB-load-misses for a program that sums a 1 GiB array sequentially, first with 4 KiB pages and then with THP enabled. (Use madvise(..., MADV_HUGEPAGE).) Report the ratio.
  2. On a multi-socket machine, run a memory-bandwidth microbench (stream) pinned to one socket but allocating from another via numactl --cpunodebind=0 --membind=1. Compare to fully local.
  3. Read /proc/$pid/numa_maps for a running process. Identify at least one mapping where pages are spread across multiple NUMA nodes.
  4. Why does Redis recommend disabling THP for the main server process but often tolerate it in child snapshot processes?
  5. What does echo 1 > /proc/sys/vm/compact_memory do, and when would an operator use it?

Read This Only If Stuck