Skip to main content

How a Context Switch Actually Works

What This Concept Is

A context switch is the operation the kernel performs to stop one thread and start another on the same CPU. At the instruction level it is "save the old context into its PCB, load the new context from the new PCB, return to user mode at the new PC."

The trigger is always one of:

  • timer interrupt fired by the hardware at the end of a quantum
  • voluntary yield via a blocking syscall (read, wait, futex_wait, etc.)
  • higher-priority task becomes runnable (e.g., wake-up from I/O)
  • explicit preemption request (sched_yield, kernel preempt point)

The save/restore mechanism is the same in all cases; only the reason for entering the kernel differs.

Why It Matters Here

"Context switch" is the glue behind every multi-tasking behavior. Without seeing the instruction-level dance, you cannot reason about its cost (Concept 11), about what threads share (Concept 12), or about why TLB flushes matter (Concept 11 again).

Concrete Example

Simplified x86-64 switch from process A to B, triggered by a timer interrupt while A is in user mode:

  1. Hardware trap entry. CPU receives interrupt. It switches to kernel mode, pushes A's user SS, RSP, RFLAGS, CS, RIP onto A's kernel stack. The CPU is now executing the timer interrupt handler.

  2. Save A's GPRs. The handler's prologue saves the rest of A's registers (rax, rbx, ..., r15, floating-point state via FXSAVE/XSAVE) into A's task_struct.

  3. Scheduler decides. schedule() looks at the run queue, picks B.

  4. Switch stacks. The kernel swaps CR3 (page-table base) only if the address space changes -- on Linux this is true when switching between processes, not between threads of the same process. It then loads B's kernel stack pointer. This is the narrow moment that is the context switch.

  5. Restore B's GPRs. Pop B's saved registers from B's task_struct.

  6. iret. Returns to user mode, popping B's saved SS, RSP, RFLAGS, CS, RIP. B resumes at the exact instruction where it was last preempted.

On Linux, switch_to() in arch/x86/include/asm/switch_to.h is the macro around step 4.

Common Confusion / Misconception

"A context switch is just saving and restoring registers." The register dance is fast -- dozens of nanoseconds. The expensive parts are:

  • crossing the user/kernel boundary (TLB hits and barriers)
  • the indirect costs: cold TLB, cold caches, mispredicted branches (Concept 11)
  • running the scheduler itself (schedule() is non-trivial)

Measure a real context switch with lmbench lat_ctx; expect 1-10 µs on a modern server. Pure register save/restore would be <100 ns.

How To Use It

Walk the switch step by step when debugging "why is this latency so high":

  1. What triggered the switch (timer, wake-up, syscall)? -> shows on perf sched.
  2. Does it cross address spaces (process->process) or stay within one (thread->thread)? -> TLB flush or not.
  3. What ran in between the two observations of the target task? -> perf sched timehist.

Check Yourself

  1. Name the four common triggers for a context switch.
  2. Why does switching between two threads of the same process skip the CR3 reload?
  3. Why is iret (or its equivalent) the last instruction in the path?

Mini Drill or Application

On your Linux machine:

  1. perf bench sched pipe -- record the reported time per iteration. That's the round-trip switch cost.
  2. perf sched record sleep 5 and then perf sched latency -- read off the max scheduling latency for perf itself.
  3. Write one paragraph explaining what distinguishes "wall time to finish" from "time spent actually running" for the traced program.

Read This Only If Stuck