Skip to main content

Linux CFS Internals: Red-Black Tree and Virtual Runtime

What This Concept Is

Linux CFS (Completely Fair Scheduler) implements fairness as: "run the task that has received the least CPU so far, weighted by nice." The concrete mechanism:

  • Each task has a virtual runtime (vruntime), a counter of how much CPU it has been given.
  • CPU time is added to vruntime scaled by a weight derived from the nice value. Nice 0 uses weight 1024; nice -5 uses weight ~3121; nice +5 uses weight ~335. Lower nice = larger weight = vruntime grows slower per real second = more actual CPU.
  • Runnable tasks are kept in a red-black tree keyed on vruntime. Per-CPU, one tree.
  • The "next task to run" is the leftmost node (smallest vruntime). Balanced-tree insert/remove is O(log n); leftmost lookup is O(1) with a cached pointer.
  • A sched latency target (≈ 6 ms at default tuning) and min granularity (≈ 0.75 ms) bound how often switching happens.
  • Newly awakened tasks are inserted with a vruntime close to the minimum to avoid starving others by "unfair catch-up."

There is no explicit time quantum; the scheduler decides "run this task until someone else's vruntime catches up, then switch."

Why It Matters Here

CFS is the production answer to "how do you make MLFQ rigorous?" It is also the default scheduler on essentially every Linux server you will ever run code on, so understanding its primitives tells you what nice, sched_setaffinity, cpu.shares cgroups, and ionice are actually doing.

Concrete Example

Three tasks, all runnable, all nice 0, weights 1024 each:

  red-black tree (keyed on vruntime):
[T2: 3.2]
/ \
[T1: 1.1] [T3: 4.7]
leftmost = T1 -> dispatched

After T1 runs for 2 ms of real time, its vruntime increases by 2 ms * (1024/1024) = 2 ms:

            [T2: 3.2]
/ \
[T1: 3.1] [T3: 4.7]
leftmost = T1 still (by a hair)

Another tiny slice, then T1's vruntime exceeds T2's, tree rebalances, T2 becomes leftmost.

Now a nice--5 task T4 joins with weight 3121:

  • T4 runs for 2 ms -> vruntime increase is 2 * (1024 / 3121) ≈ 0.66 ms.
  • T4 stays leftmost (or near) longer per unit of wall time.
  • Result: T4 gets roughly the CPU of nice-0 peers.

Leftmost-is-next + weighted vruntime is the whole scheduler, modulo load balancing and sleeper handling.

Common Confusion / Misconception

"CFS gives every task equal CPU." Only when weights are equal. The rule is proportional sharing. Two tasks with weight ratio 2:1 get CPU in ratio 2:1 over any long enough window.

"Why a red-black tree instead of a heap?" Both support O(log n) insert/delete. A red-black tree also provides O(log n) ordered iteration, which CFS uses for maintaining min_vruntime and for the balancer. A heap would require more work for the same ordered operations.

How To Use It

Knobs you can actually turn:

  • nice(+/-N) or renice -- changes a task's weight.
  • cpu.shares cgroup -- applies the same weight mechanism to a whole group of tasks (Concept 15).
  • sched_setscheduler(SCHED_BATCH / SCHED_IDLE) -- tweak how the task is inserted into the tree.
  • /proc/sys/kernel/sched_latency_ns, sched_min_granularity_ns, sched_wakeup_granularity_ns -- tune latency target vs overhead.

Changing the last three is usually wrong outside specialized workloads; they exist because defaults are compromises.

Check Yourself

  1. Why does scaling by weight make vruntime a correct measure of "how much of its share has this task consumed"?
  2. Why is "pick the task with smallest vruntime" equivalent to "pick the task most behind its fair share"?
  3. What would break if the tree key were actual runtime instead of virtual runtime?

Mini Drill or Application

  1. chrt -p <pid> and nice -n 10 yes > /dev/null & -- observe scheduling class and how top reports CPU shares between a nice-0 and nice-10 worker.
  2. Run two CPU-bound workers with nice 0 and nice -5 on the same core; compute the observed CPU ratio and compare against the theoretical weight ratio (3121/1024 ≈ 3.05).

Read This Only If Stuck