The Hidden Cost: TLB, Cache, and Branch Predictor
What This Concept Is
A context switch's direct cost (register save/restore, running schedule()) is a few hundred nanoseconds. Its indirect cost can be tens or hundreds of microseconds, driven by:
- TLB (Translation Lookaside Buffer) invalidation. The TLB caches virtual->physical translations. Switching address spaces typically flushes it (mitigated on modern CPUs by PCID / ASIDs but still has cost). The new process then takes TLB misses on every page for a while.
- L1/L2/L3 cache cold-start. The new process's working set is not in cache. It accrues cache misses until the hot set is reloaded -- and meanwhile the old process's cache lines are evicted.
- Branch predictor / BTB state is per-logical-CPU and is effectively trashed.
- Kernel entry cost: since Spectre/Meltdown, kernel entries may trigger IBPB or L1d flushes, adding several thousand cycles.
The direct cost is what benchmarks call "context switch time." The indirect cost is why your application runs slower for milliseconds after a switch.
Why It Matters Here
Every earlier answer to "just pick a smaller quantum for better response" implicitly assumed switches are free. They are not. An accurate model is:
effective_CPU_time = run_time - (n_switches * switch_cost_direct)
- warmup_cost_per_switch
where warmup_cost_per_switch is the indirect tail -- often 10-100x the direct cost on memory-intensive workloads.
This is also the reason CPU affinity matters: keeping a thread on the same core preserves its TLB entries and cache lines.
Concrete Example
Memory-bound microbenchmark, two processes on one core, RR with various quanta, hypothetical numbers that track real shapes:
| Quantum | Switches/sec | Direct cost/sec | Indirect cost/sec | Useful CPU |
|---|---|---|---|---|
| 100 ms | 20 | 40 µs | 200 µs | 99.97% |
| 10 ms | 200 | 400 µs | 2 ms | 99.76% |
| 1 ms | 2000 | 4 ms | 20 ms | 97.6% |
| 100 µs | 20k | 40 ms | 200 ms | 76% |
At quantum = 100 µs, nearly a quarter of the CPU is "warming caches between switches." That is the concrete face of the hidden cost.
Common Confusion / Misconception
"lat_ctx tells me context switches cost 5 µs, period." lat_ctx measures the round-trip through two tiny processes, so the hidden cost is near zero (working set fits in cache). On a real workload with a 10 MB hot set, the same switch pays 100x more indirectly. Always measure on a workload shaped like yours.
"Thread switches are free because there is no CR3 change." Thread switches skip the TLB flush but still evict each other from L1. On a shared L1 like x86's 32 KB, a thread that touched 64 KB just pushed the other thread's lines out.
How To Use It
When scheduling overhead shows up in a flame graph or perf output:
- Count switches per second (
perf stat -e context-switches). High numbers flag an over-sharing or too-small-quantum problem. - Look at TLB miss rate before and after (
perf stat -e dTLB-load-misses,iTLB-load-misses). - Consider pinning with
tasksetorsched_setaffinityfor cache-sensitive workers. - Raise the quantum (
sysctl kernel.sched_min_granularity_ns) only as a last resort; it hurts interactivity.
Check Yourself
- Why does a thread-to-thread switch in the same process avoid a TLB flush but still pay cache cost?
- What feature did Intel add (PCID / ASIDs) to avoid full TLB flush on process switches?
- Why is a switch on a memory-bound workload more expensive than on a CPU-bound but small-working-set one?
Mini Drill or Application
On a Linux machine:
- Run two CPU-bound loops in separate processes pinned to the same core.
- Measure wall time vs CPU time using
/usr/bin/time -v. - Repeat on two separate cores.
- Repeat with both pinned to the same core but using threads within a single process.
Write one paragraph explaining the observed ranking of wall times.