Allocator Comparison Clinic
Where abstract allocator descriptions become measured differences in RSS, throughput, and tail latency.
Retrieval Prompts
- State the primary design difference between
ptmalloc,jemalloc, andtcmalloc. - State what
LD_PRELOADdoes and why it is useful for allocator testing. - State what
MALLOC_ARENA_MAXcontrols in glibc. - State the difference between virtual size, RSS, and actually-in-use bytes.
- Explain why a service's RSS can grow while its working-set stays flat.
Compare and Distinguish
Separate these:
- internal fragmentation versus external fragmentation
- allocator-level caching versus kernel page-cache
- RSS versus "live-object bytes"
freereturning memory to the allocator versus to the kernel- thread-local caches versus per-arena locks
Common Mistake Check
Identify the error:
- "I called
free, so the RSS should drop immediately." - "
glibcmalloc is slow because it is old; newer allocators will always be faster." - "Switching allocators is risky because it changes program behavior."
- "Valgrind says no leaks, so the growing RSS is not caused by allocations."
- "The allocator only matters for multithreaded programs."
Mini Application
Lab A: Single-threaded microbenchmark
Write a program that:
- Allocates and frees 10 million objects of sizes chosen from a distribution (some uniform small like 32/64/128 bytes, some long-tail large).
- Measures total wall-clock time and RSS at the end.
Run under:
- default
glibc jemallocviaLD_PRELOAD=$(jemalloc-config --libdir)/libjemalloc.so.$(jemalloc-config --revision)tcmallocviaLD_PRELOAD=/usr/lib/libtcmalloc.so(path may vary)
Record: throughput (allocs/sec), peak RSS, RSS at exit.
Lab B: Multithreaded contention
Extend the program above to N threads, each doing the allocation/free loop independently. Measure throughput and per-thread stalls (perf stat -e task-clock,context-switches,cpu-migrations).
Under glibc, try setting MALLOC_ARENA_MAX=1 and MALLOC_ARENA_MAX=N to see how arenas affect contention.
Compare glibc vs jemalloc vs tcmalloc on 1, 4, 16, 64 threads. Plot throughput vs threads.
Lab C: RSS shape under a long-running workload
Simulate a service: a main loop that allocates 100,000 16-KiB objects, frees 90% of them, then repeats for 1,000 rounds. Record RSS after every 10 rounds.
Repeat under glibc and jemalloc (and optionally tcmalloc).
You should see jemalloc return RSS to the kernel more aggressively; glibc may hold onto a much higher watermark.
Lab D: Fragmentation stress
Design an allocation pattern that maximizes external fragmentation for glibc (e.g., allocate many 4000-byte objects, free every other one, then try to allocate 8000-byte objects). Measure RSS under glibc vs. jemalloc vs. tcmalloc.
Interpretation
After each lab, write a one-paragraph conclusion naming:
- which allocator won on throughput
- which allocator won on RSS
- any surprising result and its likely cause (arena contention, size-class rounding, release policy)
Evidence Check
This clinic is complete only if you can:
- produce a single table comparing all three allocators on a single workload for: peak RSS, RSS-at-exit, throughput, 99th percentile alloc latency
- explain, in writing, at least one case where switching from
glibctojemallocsignificantly changed RSS on your machine and why - defend a choice of allocator for a hypothetical long-running memory-sensitive service