Userland Allocators: dlmalloc, ptmalloc, jemalloc, tcmalloc
What This Concept Is
A userland allocator sits between the program and the kernel. It asks the kernel for memory in large chunks (via sbrk, mmap, or brk) and divides those chunks into application-sized allocations (malloc, free, calloc, realloc).
Four widely used allocators:
- dlmalloc (Doug Lea, early 1990s). The ancestor. Single heap, boundary tags on every chunk, segregated free lists by size class. Simple, sequential, single-threaded origin.
- ptmalloc (Wolfram Gloger).
glibc's default. Extends dlmalloc with multiple per-thread arenas to reduce lock contention, dynamic arena creation, and somemmapthresholds. - jemalloc (Jason Evans, FreeBSD and Facebook). Multiple arenas assigned to threads by hashing, strict size classes, small/large/huge pool separation, strong statistics, low fragmentation, good for long-running services.
- tcmalloc (Google). Per-thread caches at the front, central free lists per size class behind, page heap at the back. Very fast small-allocation path, excellent multithreaded scaling.
All four use the same core idea: group requests by size class, pool per-thread or per-arena, and fall back to a central or global layer when a local pool runs dry.
Why It Matters Here
Allocator choice is a measurable performance decision:
- Multithreaded servers doing lots of small allocations routinely see 20-50% throughput improvement from switching
glibc->jemallocortcmallocjust by linking differently. - Services with very spiky memory growth and shrink (short-lived workers inside a long-lived process) often show much lower RSS on
jemallocthan onptmalloc, becausejemallocreturns pages to the kernel more aggressively. - A latency-sensitive service may be ruined by
ptmallocarena contention under load; moving tojemallocortcmallocremoves the contention without any code change.
Understanding allocator internals also explains confusing symptoms: "memory leak" that turns out to be fragmentation in an untrimmed heap; CPU time in malloc under lock contention; long pauses from malloc_trim; MALLOC_ARENA_MAX knobs that change memory use dramatically.
Concrete Example
Size classes in jemalloc. Requests are rounded up to a fixed set of sizes: 8, 16, 32, 48, 64, 80, 96, 112, 128, 160, 192, 224, 256, ... (small), then larger multiples up to "huge" (page-aligned via mmap). Each size class gets its own runs and free lists; within a run, allocation and free are O(1). Internal fragmentation is bounded by the gap between your request and the next size class.
tcmalloc thread cache. Each thread keeps free lists for common size classes, holding typically up to 64 KiB of free objects. Allocations and frees hit this cache first; no lock. When the cache is too full or too empty, the thread refills from or returns to the central heap. Small allocation path is a handful of assembly instructions.
ptmalloc arena model. On first thread creation, a new arena may be created. Each arena has its own lock. A thread hashes to an arena; if locked, it may try another. On modern glibc, MALLOC_ARENA_MAX often defaults to 8 * num_cpus, which is a lot and contributes to RSS bloat in long-lived multithreaded services.
Symptom example. A 16-core Python server under uwsgi sees RSS 24 GiB, working set ~4 GiB, and mysterious 40% time in libc under perf. LD_PRELOAD=/usr/lib/libjemalloc.so.2 ./server drops RSS to ~6 GiB and removes most of the libc time.
Common Confusion / Misconception
"An allocator is just a data-structure library." It is a systems component with its own memory-management policy, threading model, kernel interaction, and observability. Treating it as a black box is how services end up with mysterious memory bloat.
"free returns memory to the OS." Usually not. It returns memory to the allocator's free lists. Whether the allocator returns those pages to the kernel is a separate decision governed by trim/retain policies. jemalloc and tcmalloc are more aggressive about this than ptmalloc.
"All modern allocators are roughly the same." They are not. They differ significantly in fragmentation, scaling, memory-release behavior, and sensitivity to allocation pattern. Changing allocator on a production service often moves RSS by 30%+ without any code change.
How To Use It
When profiling a memory-heavy service:
- Measure
malloc/freetime withperfor the allocator's own tooling (jemallochasmalloc_stats_print,tcmallochasMallocExtension::GetStats). - Check per-thread contention. Multiple threads in
__lll_lock_waitinside libc is an arena-contention signature. - Check RSS vs. working set. If RSS is far above working set and grows monotonically without leaks, suspect fragmentation and allocator policy.
- Consider switching.
LD_PRELOADis a cheap way to testjemallocortcmallocwithout recompiling.
Check Yourself
- Why is a single global lock on
malloca scalability killer on modern multi-core hardware? - What is a "size class," and why does it bound internal fragmentation?
- What does
MALLOC_ARENA_MAXcontrol inglibc, and why might setting it lower reduce RSS for some services? - Why does a per-thread cache in
tcmallocmake the fast path nearly lock-free? - Under what workload would
ptmallocandjemallocperform roughly the same?
Mini Drill or Application
- Write a microbenchmark: N threads, each doing
malloc(64); free(...)in a loop. Measure throughput underglibc,jemalloc, andtcmallocviaLD_PRELOAD. (You will do this fully in Practice 3.) - A service's RSS grows 50% over a week while working set stays flat. Is this a leak, fragmentation, or arena accumulation? How would you tell?
- What does
MALLOC_CONF="stats_print:true"do for ajemalloc-linked process? - Explain why a pure segregated free-list allocator (without size classes) can suffer from fragmentation that a size-class allocator does not.
- Describe an allocation pattern that defeats
tcmalloc's thread-cache optimization (hint: cross-thread free).
Read This Only If Stuck
- OSTEP: 14.1 Types of Memory
- OSTEP: 14.2 The
mallocCall - OSTEP: 14.4 Common Errors
- OSTEP: 14.7 Summary
- OSTEP: 17.1 Assumptions (Free-Space Management)
- OSTEP: 17.2 Low-level Mechanisms
- OSTEP: 17.2 Low-level Mechanisms (Part 2)
- OSTEP: G.3 Memory Allocation Library
- Operating System Concepts: D.6.2 User-Level Memory Managers