Module 2: Memory Management & Virtual Memory: Case Studies
These case studies make virtual memory operational: page tables, TLBs, faults, copy-on-write, huge pages, allocators, and thrashing.
Case Study 1: Page Fault Storm After Deploy
Scenario: A service starts slowly after deploy and major page faults spike. The container image changed memory access patterns and cold-starts now touch a much larger working set.
Source anchor: Linux kernel Page Tables explains hierarchical virtual-to-physical address translation.
Module concepts: page table, demand paging, major fault, working set.
Wrong Approach
"A page fault is always a crash."
Better Approach
Classify faults:
minor fault:
mapping exists, no disk I/O
major fault:
requires disk or backing store I/O
fix:
reduce cold working set, warm critical paths, adjust image/data layout
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| accept larger cold start | no engineering work | slower deploy recovery |
| warm critical paths | faster readiness | extra startup work |
| reduce working set | fewer major faults | possible feature/layout changes |
Failure Mode
The deploy changes which pages are touched during startup, so the process triggers a burst of major faults while demand-paging a much larger cold working set.
Project / Capstone Connection
Use this when diagnosing slow startup in containerized services, ML inference apps, or student deployments that load large assets on boot.
Required Artifact
Write a fault diagnosis: minor/major counts, working set, cold path, and mitigation.
Case Study 2: TLB Pressure And Huge Pages
Scenario: An in-memory analytics process scans a 60 GB column store. CPU usage is high but memory bandwidth is not saturated. TLB misses dominate.
Source anchor: Linux Transparent Hugepage Support explains using larger pages to reduce translation overhead.
Module concepts: TLB, page size, huge pages, locality.
Wrong Approach
"More RAM fixes memory performance."
Better Approach
Look at translation cost:
4 KiB pages:
many translations
huge pages:
fewer TLB entries for same memory
risk:
fragmentation and latency spikes from compaction
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| stay on 4 KiB pages | predictable behavior | more TLB pressure |
| enable huge pages | fewer translations | fragmentation risk |
| selective huge-page use | targeted gain | operational complexity |
Failure Mode
The workload fits in memory, but address translation becomes the bottleneck because the TLB cannot cover enough of the active scan efficiently.
Project / Capstone Connection
This is relevant for analytics, search, or simulation capstones that scan large in-memory structures and need to distinguish bandwidth limits from translation limits.
Required Artifact
Write a huge-page decision note with workload, TLB metric, expected gain, and rollback.
Case Study 3: Copy-On-Write Surprise After Fork
Scenario: A server forks workers after loading a large model. Memory looks shared at first, then each worker mutates global caches and RSS grows sharply.
Source anchor: Linux fork(2) documentation explains child creation; copy-on-write behavior is a core virtual-memory mechanism associated with fork. See fork(2).
Module concepts: fork, copy-on-write, RSS, shared pages.
Wrong Approach
"Forked workers share all memory forever."
Better Approach
Keep shared pages read-only where possible:
load model before fork
avoid post-fork writes to shared structures
move mutable caches per worker intentionally
measure RSS/PSS
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| fork after preload | strong initial sharing | sensitive to later writes |
| mutate shared caches post-fork | simpler code reuse | RSS growth |
| isolate mutable state per worker | predictable memory | more duplication |
Failure Mode
Workers begin writing to previously shared pages, which breaks copy-on-write sharing and causes resident memory to grow per process.
Project / Capstone Connection
Use this when scaling prefork servers, inference workers, or data-processing daemons that load large common state before forking.
Required Artifact
Draw a COW page diagram before and after one worker writes.
Case Study 4: Allocator Fragmentation In A Long-Running Service
Scenario: A service allocates many varied-size objects. Heap usage grows even after request load drops.
Source anchor: malloc behavior is implementation-specific; use mallopt(3) and allocator docs to ground the investigation.
Module concepts: allocator, fragmentation, arenas, RSS, heap profiling.
Wrong Approach
"If objects are freed, memory returns to the OS immediately."
Better Approach
Profile allocation shape:
size classes:
allocation lifetime:
thread arenas:
fragmentation:
RSS vs live heap:
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| ignore fragmentation | no immediate effort | persistent RSS growth |
| tune allocator behavior | lower waste | platform-specific tuning |
| reshape allocation patterns | durable fix | code changes |
Failure Mode
Freed objects leave holes across arenas and size classes, so live heap drops while RSS stays high and reuse becomes inefficient.
Project / Capstone Connection
This fits long-running APIs, brokers, or game servers in capstones where memory growth appears long after the triggering request burst.
Required Artifact
Write an allocator investigation note with allocation profile and mitigation.
Case Study 5: Thrashing From Oversubscribed Memory
Scenario: Five services fit in memory individually but thrash when colocated. CPU drops, disk I/O rises, latency explodes.
Source anchor: Linux memory-management and cgroup docs explain memory pressure and limits. See Linux cgroup v2 memory controller.
Module concepts: working set, thrashing, swapping, cgroup memory, OOM.
Wrong Approach
"CPU is low, so the service is not busy."
Better Approach
Treat memory as the bottleneck:
working set total:
exceeds RAM
symptoms:
major faults, swap I/O, reclaim, OOM kills
fix:
reduce colocated workload or set memory limits
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| colocate all services | high utilization on paper | thrashing risk |
| set memory limits | isolation | earlier OOM or eviction |
| spread services across nodes | stable latency | infrastructure cost |
Failure Mode
The combined working sets exceed available RAM, so reclaim and swap dominate execution and useful CPU work collapses.
Project / Capstone Connection
Apply this when packing multiple student services onto a shared VM or Kubernetes node and deciding whether memory isolation or placement must change.
Required Artifact
Create a memory pressure report with working set, faults, swap/reclaim, cgroup limit, and placement decision.
Source Map
| Source | Use it for |
|---|---|
| Linux Page Tables | virtual-to-physical translation |
| Transparent Hugepage Support | huge pages and TLB pressure |
| fork(2) | fork and COW context |
| mallopt(3) | allocator tuning surface |
| Linux cgroup v2 | memory control and pressure |
Completion Standard
- At least three artifacts are completed.
- At least one artifact walks address translation or COW.
- At least one artifact diagnoses faults or memory pressure.