The Classic Five-Stage Pipeline

What This Concept Is

A pipelined CPU overlaps the work of many instructions so that each stage of the datapath is doing useful work every cycle. The canonical decomposition taught in the MIPS/RISC-V tradition has five stages:

IF -- instruction fetch (read the instruction at PC from memory)
ID -- instruction decode (read registers, produce control signals)
EX -- execute (ALU op, address computation)
MEM -- memory access (for loads and stores)
WB -- write-back (write the result into the register file)

At steady state, five instructions are in flight at once -- one per stage -- and the throughput is one instruction per cycle (CPI = 1), even though each individual instruction takes five cycles from fetch to retire.

Think of it as a factory assembly line. Each station does the same thing on a different car. The cars get built faster not because any single step got faster, but because they are built in parallel.

Why It Matters Here

Every modern CPU is pipelined, usually with more stages than five (modern cores have 14-20). The five-stage diagram is still how every textbook, every professor, and every performance counter interprets what the core is doing. If you cannot draw it, you cannot reason about:

why a load-use dependency costs one stall cycle even with forwarding
why a mispredicted branch throws away 10-20 cycles in a deeper pipeline
why the compiler cares about instruction scheduling ordering

Concrete Example

Three independent instructions through a 5-stage pipeline:

Cycle:     1     2     3     4     5     6     7
add  r1:  IF    ID    EX    MEM   WB
sub  r2:        IF    ID    EX    MEM   WB
or   r3:              IF    ID    EX    MEM   WB

Each instruction takes 5 cycles (latency) but the pipeline retires one per cycle (throughput). Over long runs, sustained CPI approaches 1.

Contrast with a load-use dependency:

Cycle:     1     2     3     4     5     6
lw  r1:   IF    ID    EX    MEM   WB
add r1:         IF    ID    ??    EX    MEM   WB

add wants r1 in EX (cycle 4), but lw does not produce r1 until the end of MEM (cycle 4). Forwarding from MEM->EX helps for ALU-to-ALU dependencies but not for load->ALU: a one-cycle bubble is inserted. The compiler fills this slot with an independent instruction when it can.

Common Confusion / Misconception

"Pipelining halves the cycle time." No. Pipelining divides the clock period across the longest stage, which lets you raise the clock frequency. It also adds pipeline-register delay between stages. The real win is instruction-level parallelism: each stage always has work to do.

Another mistake: counting an instruction's latency (cycles from issue to result available) as if it were its cost to throughput. In a well-scheduled loop, an add contributes one cycle of throughput even though its latency is four; they are different numbers and both matter.

How To Use It

When reading a short instruction stream, lay out the pipeline stages as above and check for conflicts: load-use, RAW dependencies, branch resolution.
Think of compiler scheduling as filling bubbles with useful work. If you see the compiler reorder apparently independent statements, this is why.
When tuning a loop, aim for no stalls on the critical path first; optimizing rarely-taken paths is a waste.
Remember: real cores have more stages and can issue multiple instructions per cycle (superscalar). The five-stage model is a teaching tool, not a literal description of 2025 hardware -- but the reasoning transfers.

Check Yourself

What are the five stages of the classic pipeline, and what happens in each?
What is the difference between latency and throughput for a single instruction?
Why does a load-use dependency cost a bubble even with forwarding?
How does pipelining relate to higher clock frequency?

Mini Drill or Application

Draw the pipeline diagram for the following stream, assuming ideal forwarding from EX->EX and MEM->EX, and note any stalls:

lw   t0, 0(a0)
add  t1, t0, t2
sub  t3, t1, t4
sw   t3, 8(a0)
lw   t5, 16(a0)
add  t6, t5, t3

For each stall, explain whether forwarding helps or the bubble is unavoidable.

Then rewrite the stream to eliminate as many stalls as possible by reordering independent instructions. This is exactly what an optimizing compiler does, and doing it by hand a few times teaches you why -O2 helps even on straight-line code.

Why Modern Pipelines Are Deeper

Going from five stages to fourteen or more serves three goals:

Higher clock frequency -- each stage does less work, so the cycle can be shorter.
Lower per-stage complexity -- a deep decode pipeline handles x86 variable-length instructions without dragging down the whole core.
More room for speculation -- more stages means more in-flight instructions, which hides more memory latency.

The cost is paid in misprediction penalties and in the design effort of forwarding, hazard logic, and speculation machinery. The five-stage picture is the simplest version of the same fundamental ideas.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Why Modern Pipelines Are Deeper​

Read This Only If Stuck​