Hazards: Data, Control, Structural -- and How to Resolve Them

What This Concept Is

A hazard is any condition that prevents the next instruction from executing in the next cycle of the pipeline. Three flavours:

Data hazard -- an instruction needs a value another in-flight instruction has not yet produced. Subtypes: RAW (read-after-write, the common case), WAR, WAW (matter only for out-of-order machines).
Control hazard -- the pipeline does not yet know where the next instruction comes from because a branch or jump is still being resolved.
Structural hazard -- two in-flight instructions want the same hardware resource (e.g. a single memory port) in the same cycle.

Hardware resolves these with three main techniques:

Forwarding (bypassing) -- route the result from a later pipeline stage directly to an earlier stage so the dependent instruction does not wait for write-back. Eliminates most ALU-to-ALU stalls.
Stalling (bubble insertion) -- hold a dependent instruction in place for one or more cycles. The unavoidable cost when forwarding is not enough (classic case: load-use).
Branch prediction + speculation -- guess the branch direction and target; squash the speculated work if the guess was wrong.

The compiler's job complements this: schedule independent instructions into stall slots and structure code so predictors have an easy time.

Why It Matters Here

Hazards are where theoretical CPI = 1 leaks to reality's CPI > 1. Knowing the three categories gives you a checklist when tuning a hot loop:

Are there load-use stalls the compiler could hide with unrolling or prefetch?
Is there a branch the predictor cannot learn (data-dependent and non-correlated)?
Is there a port conflict (two stores per cycle when the core has one store port)?

Each question maps to a specific fix at the source or compiler-flag level.

Concrete Example

Load-use stall (data hazard, not fixable by forwarding):

    lw   t0, 0(a0)      # IF ID EX MEM WB
    add  t1, t0, t2     # IF ID ** EX MEM WB   ← one-cycle bubble
    add  t4, t5, t6     # IF ID EX MEM WB      ← independent work filling?

Compiler tip: reorder so an independent instruction sits between lw and the dependent add, eliminating the bubble.

Control hazard (conditional branch):

    beq  t0, t1, .Ltarget   # predict taken/not-taken?
    add  t2, t3, t4         # speculatively fetched
    ...

If the core predicts taken and the branch is not taken, every instruction fetched speculatively after the branch must be flushed. In a 5-stage pipeline that is ~4 wasted cycles; in a 20-stage pipeline it is closer to 20. That is the branch misprediction penalty.

Structural hazard:

In a single-memory-port pipeline, an lw in MEM at cycle 4 conflicts with an IF at cycle 4 (both want memory). Separate instruction and data caches (Harvard architecture at L1) eliminate this for normal code.

Common Confusion / Misconception

"Forwarding eliminates all data hazards." No -- it eliminates most ALU-to-ALU RAW hazards by routing the result as soon as the producing stage completes. But load-use dependencies still need a one-cycle bubble because MEM (when the load data returns) is one stage after EX (when the consumer needs it). Worse, on deeper pipelines or with L1 misses, the bubble can be many cycles.

Another trap: confusing "branch taken" with "branch correctly predicted." A branch that is taken 100% of the time can be predicted perfectly and cost nothing. A branch taken 50% of the time with no structure is the expensive one.

How To Use It

When a profile shows high cycles per instructions, run perf stat -e cycles,instructions,stalled-cycles-frontend,stalled-cycles-backend,branch-misses,cache-misses to identify the hazard category.
Front-end stalls -> your code is I-cache- or branch-bound. Consider reducing code size or restructuring branches.
Back-end stalls -> classic memory or dependency bottleneck. Look for load-use chains, long-latency FP, or port conflicts.
High branch-misses -> try to make branches more predictable (sort inputs, branchless code, table lookup).
Use llvm-mca or uiCA to see the static schedule of a short snippet on your specific microarchitecture.

Check Yourself

Name the three hazard categories and give an example of each.
Why does forwarding solve ALU-to-ALU RAW hazards but not load-to-ALU ones?
What is the difference between a branch being taken and being correctly predicted?
How does a deep pipeline change the cost of a branch mispredict?

Mini Drill or Application

Classify each of the following as a data, control, or structural hazard, and suggest a fix:

lw t0, 0(a0); add t1, t0, t2 in back-to-back cycles
beq t0, t1, label where the branch outcome is nearly random
A 32-bit instruction cache that cannot serve two instruction fetches per cycle on a 2-wide issue core
Two independent loops reading the same cache line with different strides, one evicting the other

For each, name the hardware or software technique that reduces its cost.

Now instrument a small program: a linear search over a shuffled int array vs. the same search over a sorted array. The sorted version will show dramatically fewer branch-misses in perf stat because the branch predictor can learn the monotone pattern. This is the single cleanest demonstration of control hazards in practice.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​