Pipelining and Optimization Clinic
Retrieval Prompts
- Name the five stages of the classic pipeline.
- Define latency and throughput for a single instruction.
- State the three categories of hazard.
- Explain what forwarding (bypassing) is and which hazard type it resolves.
- Define branch misprediction penalty.
Compare and Distinguish
Separate these pairs clearly:
- latency vs throughput
- stall (bubble) vs forwarding
- branch taken vs branch correctly predicted
- in-order vs out-of-order execution
- scalar vs SIMD reduction
Common Mistake Check
For each statement, identify the error:
- "Pipelining halves cycle time."
- "A taken branch is a performance cost."
- "Forwarding eliminates all data hazards."
- "Out-of-order execution means programs are no longer deterministic."
- "
-O3auto-vectorizes any loop."
Hazard Analysis Sheet
For each instruction stream (assume classic 5-stage pipeline with EX->EX and MEM->EX forwarding), identify stalls:
Stream A
lw t0, 0(a0)
add t1, t0, t2
sub t3, t1, t4
Stream B
add t1, t0, t2
mul t3, t1, t4
sw t3, 0(a0)
lw t5, 4(a0)
Stream C
beq t0, zero, label
add t1, t2, t3
label:
sub t4, t5, t6
For each stream: identify the hazard type, count bubbles, and propose a source-level or compiler reordering that would reduce them.
Microbenchmark: Accumulator Unrolling
Write a dot-product kernel and compare:
// Version 1: one accumulator (long dependency chain)
double dot1(const double *a, const double *b, int n) {
double s = 0;
for (int i = 0; i < n; ++i) s += a[i] * b[i];
return s;
}
// Version 2: four accumulators
double dot4(const double *a, const double *b, int n) {
double s0=0, s1=0, s2=0, s3=0;
int i;
for (i = 0; i + 3 < n; i += 4) {
s0 += a[i+0] * b[i+0];
s1 += a[i+1] * b[i+1];
s2 += a[i+2] * b[i+2];
s3 += a[i+3] * b[i+3];
}
double s = s0 + s1 + s2 + s3;
for (; i < n; ++i) s += a[i] * b[i];
return s;
}
Benchmark both for n = 1,048,576 with arrays that fit in L1, L2, and DRAM respectively. Report GFLOPS for each.
Target outcomes:
- for the L1 case,
dot4is 2x-4x faster because the FMA latency no longer bottlenecks the loop - for the DRAM case, both are bandwidth-bound and similar
Branch Predictability Drill
Write a loop if (a[i] > threshold) count++; and compare:
ais sorted ascending.ais random.ais random butcountis summed branchlessly:count += (a[i] > threshold);
Measure time and perf stat -e branch-misses.
Target outcomes:
- sorted version runs 3x-6x faster than random
- branchless version matches or beats the sorted case on random data
Static Analysis
Pick one of your inner loops from the microbenchmarks and analyze it with llvm-mca or uiCA online. Identify the bottleneck port or dependency.
Evidence Check
This page is complete only if you produced measured numbers (not estimates) for at least two of the three microbenchmarks and can explain the dominant hazard or bound for each.