Pipelining and Optimization Clinic

Retrieval Prompts

Name the five stages of the classic pipeline.
Define latency and throughput for a single instruction.
State the three categories of hazard.
Explain what forwarding (bypassing) is and which hazard type it resolves.
Define branch misprediction penalty.

Compare and Distinguish

Separate these pairs clearly:

latency vs throughput
stall (bubble) vs forwarding
branch taken vs branch correctly predicted
in-order vs out-of-order execution
scalar vs SIMD reduction

Common Mistake Check

For each statement, identify the error:

"Pipelining halves cycle time."
"A taken branch is a performance cost."
"Forwarding eliminates all data hazards."
"Out-of-order execution means programs are no longer deterministic."
"-O3 auto-vectorizes any loop."

Hazard Analysis Sheet

For each instruction stream (assume classic 5-stage pipeline with EX->EX and MEM->EX forwarding), identify stalls:

Stream A

lw   t0, 0(a0)
add  t1, t0, t2
sub  t3, t1, t4

Stream B

add  t1, t0, t2
mul  t3, t1, t4
sw   t3, 0(a0)
lw   t5, 4(a0)

Stream C

beq  t0, zero, label
add  t1, t2, t3
label:
sub  t4, t5, t6

For each stream: identify the hazard type, count bubbles, and propose a source-level or compiler reordering that would reduce them.

Microbenchmark: Accumulator Unrolling

Write a dot-product kernel and compare:

// Version 1: one accumulator (long dependency chain)
double dot1(const double *a, const double *b, int n) {
    double s = 0;
    for (int i = 0; i < n; ++i) s += a[i] * b[i];
    return s;
}

// Version 2: four accumulators
double dot4(const double *a, const double *b, int n) {
    double s0=0, s1=0, s2=0, s3=0;
    int i;
    for (i = 0; i + 3 < n; i += 4) {
        s0 += a[i+0] * b[i+0];
        s1 += a[i+1] * b[i+1];
        s2 += a[i+2] * b[i+2];
        s3 += a[i+3] * b[i+3];
    }
    double s = s0 + s1 + s2 + s3;
    for (; i < n; ++i) s += a[i] * b[i];
    return s;
}

Benchmark both for n = 1,048,576 with arrays that fit in L1, L2, and DRAM respectively. Report GFLOPS for each.

Target outcomes:

for the L1 case, dot4 is 2x-4x faster because the FMA latency no longer bottlenecks the loop
for the DRAM case, both are bandwidth-bound and similar

Branch Predictability Drill

Write a loop if (a[i] > threshold) count++; and compare:

a is sorted ascending.
a is random.
a is random but count is summed branchlessly: count += (a[i] > threshold);

Measure time and perf stat -e branch-misses.

Target outcomes:

sorted version runs 3x-6x faster than random
branchless version matches or beats the sorted case on random data

Static Analysis

Pick one of your inner loops from the microbenchmarks and analyze it with llvm-mca or uiCA online. Identify the bottleneck port or dependency.

Evidence Check

This page is complete only if you produced measured numbers (not estimates) for at least two of the three microbenchmarks and can explain the dominant hazard or bound for each.

Retrieval Prompts​

Compare and Distinguish​

Common Mistake Check​

Hazard Analysis Sheet​

Microbenchmark: Accumulator Unrolling​

Branch Predictability Drill​

Static Analysis​

Evidence Check​