Multi-Core Scheduling and Load Balancing
What This Concept Is
A single scheduler queue across many cores is simple but does not scale: every core contends on the same lock on every scheduling event. Modern kernels use per-CPU run queues plus a load balancer that migrates tasks across queues.
Key ideas:
- Per-CPU run queues: each core has its own runqueue and picks locally, avoiding global lock contention.
- Load balancing: when one queue is overloaded and another idle, the kernel moves tasks. Linux CFS runs its balancer periodically (via
scheduler_tickand on idle-entry). - CPU affinity: a process's allowed set of CPUs, via
sched_setaffinity(2). Hard affinity means "never run elsewhere"; soft affinity (default) is a preference respected by the balancer. - Scheduling domains: Linux groups CPUs hierarchically -- SMT siblings, cores of a socket, NUMA node, whole system -- and balances preferentially at the cheapest level first.
- Work stealing: idle CPUs pull from busy ones rather than busy ones pushing; keeps hot cores productive.
- Cache hotness and NUMA affinity: moving a task costs you its L1/L2 and its memory locality on NUMA systems.
Why It Matters Here
Almost every machine you care about is multi-core. Uniprocessor scheduling concepts are necessary but not sufficient. Multi-core adds three new decisions the scheduler must make:
- Which CPU to run a newly woken task on?
- When to migrate tasks between CPUs?
- How to respect affinity and NUMA locality while keeping load balanced?
Getting these wrong shows up as: cores idle while others are overloaded, tasks bouncing between CPUs, NUMA memory fetches dominating runtime.
Concrete Example
Four cores, four tasks, pinned to core 0 by mistake:
Core 0: T1 T2 T3 T4 (all queued here)
Core 1-3: idle
Each task gets 25% of one core and all four fight for the same L1/L2. Total useful throughput ≈ 1 core. Unpinning:
Core 0: T1 (100%)
Core 1: T2 (100%)
Core 2: T3 (100%)
Core 3: T4 (100%)
Throughput ≈ 4×.
Conversely, on a NUMA system with memory allocated on node 0:
Core 0 (node 0): T1 runs fast (local memory)
Core 8 (node 1): same T1 migrated runs slow (remote memory, 2-3x latency)
This is why Linux has numactl, numa_balancing, and why databases and scientific codes tune affinity explicitly.
Work stealing example on an idle-wakeup path:
Core 0 runqueue: T1, T2, T3, T4 (load = 4)
Core 1 runqueue: empty (idle)
Scheduler tick on core 1 -> runs balance_fair -> steals T3, T4 from core 0.
After:
Core 0: T1, T2 (load = 2)
Core 1: T3, T4 (load = 2)
Common Confusion / Misconception
"More cores always means faster." Only if the work is parallelizable and the scheduler can keep cores useful. Lock contention, cache line bouncing ("false sharing"), NUMA penalties, and synchronization costs can erase gains. Amdahl's law still applies; the scheduler cannot escape it.
"Affinity is always a win." Hard affinity can hurt when a burst of wakeups targets one core while others sit idle. Let the scheduler balance unless you have measured a specific problem.
How To Use It
When investigating a performance problem on a multi-core box:
mpstat -P ALL 1-- is the load balanced? If not, why?perf stat -a -e migrations-- are tasks bouncing unnecessarily?numactl --hardware-- what NUMA topology am I on?taskset -pc <pid>-- what affinity does the process have?/proc/<pid>/sched-- per-task scheduler counters.
Change affinity only after these tell you something.
Check Yourself
- Why is a single global run queue a problem at 64+ cores?
- What is the difference between push-based and pull-based (work-stealing) balancing?
- Why might migrating a task cost more on a NUMA machine than on a uniform-memory one?
Mini Drill or Application
On a multi-core Linux host:
- Run
stress-ng --cpu 4and watchmpstat -P ALL 1. Note how cores fill. - Now run
taskset -c 0 stress-ng --cpu 4. Watch what happens. - Run
perf stat -e migrations,context-switches -- sleep 5while four CPU-bound tasks are competing for three cores. Read the migration count. - In one paragraph, describe what the numbers tell you about the balancer's behavior.