How a Context Switch Actually Works
What This Concept Is
A context switch is the operation the kernel performs to stop one thread and start another on the same CPU. At the instruction level it is "save the old context into its PCB, load the new context from the new PCB, return to user mode at the new PC."
The trigger is always one of:
- timer interrupt fired by the hardware at the end of a quantum
- voluntary yield via a blocking syscall (
read,wait,futex_wait, etc.) - higher-priority task becomes runnable (e.g., wake-up from I/O)
- explicit preemption request (
sched_yield, kernel preempt point)
The save/restore mechanism is the same in all cases; only the reason for entering the kernel differs.
Why It Matters Here
"Context switch" is the glue behind every multi-tasking behavior. Without seeing the instruction-level dance, you cannot reason about its cost (Concept 11), about what threads share (Concept 12), or about why TLB flushes matter (Concept 11 again).
Concrete Example
Simplified x86-64 switch from process A to B, triggered by a timer interrupt while A is in user mode:
-
Hardware trap entry. CPU receives interrupt. It switches to kernel mode, pushes
A's userSS,RSP,RFLAGS,CS,RIPontoA's kernel stack. The CPU is now executing the timer interrupt handler. -
Save
A's GPRs. The handler's prologue saves the rest ofA's registers (rax,rbx, ...,r15, floating-point state viaFXSAVE/XSAVE) intoA'stask_struct. -
Scheduler decides.
schedule()looks at the run queue, picksB. -
Switch stacks. The kernel swaps
CR3(page-table base) only if the address space changes -- on Linux this is true when switching between processes, not between threads of the same process. It then loadsB's kernel stack pointer. This is the narrow moment that is the context switch. -
Restore
B's GPRs. PopB's saved registers fromB'stask_struct. -
iret. Returns to user mode, poppingB's savedSS,RSP,RFLAGS,CS,RIP.Bresumes at the exact instruction where it was last preempted.
On Linux, switch_to() in arch/x86/include/asm/switch_to.h is the macro around step 4.
Common Confusion / Misconception
"A context switch is just saving and restoring registers." The register dance is fast -- dozens of nanoseconds. The expensive parts are:
- crossing the user/kernel boundary (TLB hits and barriers)
- the indirect costs: cold TLB, cold caches, mispredicted branches (Concept 11)
- running the scheduler itself (
schedule()is non-trivial)
Measure a real context switch with lmbench lat_ctx; expect 1-10 µs on a modern server. Pure register save/restore would be <100 ns.
How To Use It
Walk the switch step by step when debugging "why is this latency so high":
- What triggered the switch (timer, wake-up, syscall)? -> shows on
perf sched. - Does it cross address spaces (process->process) or stay within one (thread->thread)? -> TLB flush or not.
- What ran in between the two observations of the target task? ->
perf sched timehist.
Check Yourself
- Name the four common triggers for a context switch.
- Why does switching between two threads of the same process skip the
CR3reload? - Why is
iret(or its equivalent) the last instruction in the path?
Mini Drill or Application
On your Linux machine:
perf bench sched pipe-- record the reported time per iteration. That's the round-trip switch cost.perf sched record sleep 5and thenperf sched latency-- read off the max scheduling latency forperfitself.- Write one paragraph explaining what distinguishes "wall time to finish" from "time spent actually running" for the traced program.