Superscalar, Out-of-Order, SIMD, and Speculation
What This Concept Is
Modern cores go beyond "one instruction per cycle" in four intersecting ways:
- Superscalar issue -- fetch, decode, and issue multiple instructions per cycle through multiple execution ports. A 2-wide, 4-wide, or 8-wide core can retire that many instructions per cycle when dependencies allow.
- Out-of-order execution -- the core reorders independent instructions around stalled ones via a reservation station / reorder buffer. Instructions retire in program order but execute as soon as their operands are ready.
- SIMD (single instruction, multiple data) -- one instruction operates on a vector of lanes. On x86_64: SSE (128-bit), AVX/AVX2 (256-bit), AVX-512 (512-bit). On ARM: NEON and SVE. A 256-bit FMA can do 4
doubles per cycle, vs 1 for scalar. - Speculation -- branch prediction, indirect-call prediction, and memory-disambiguation prediction all let the core keep going past unresolved control and data, rolling back if the guess was wrong.
Together these turn a "simple" 5-stage pipeline into a machine that can sustain 3-4 IPC on well-tuned code and near zero IPC on pointer chases.
Why It Matters Here
This is a supporting concept because the cost model suddenly matters less at the instruction level and more at the resource level:
- total issue bandwidth (e.g. 4 µops/cycle)
- per-port throughput (e.g. one FP multiply per cycle on a specific port)
- retirement bandwidth
- the reorder buffer size (how much the core can speculate past a miss)
When a tight loop stops getting faster despite removing instructions, you are likely port-bound or ROB-bound, not instruction-bound.
Speculation is also the foundation under Spectre-style side channels -- a useful reason to understand it precisely.
Concrete Example
Consider a scalar vs SIMD dot product:
// Scalar
for (int i = 0; i < N; ++i) s += a[i] * b[i];
// AVX2 (manually, with intrinsics)
__m256d acc = _mm256_setzero_pd();
for (int i = 0; i < N; i += 4) {
__m256d va = _mm256_loadu_pd(a + i);
__m256d vb = _mm256_loadu_pd(b + i);
acc = _mm256_fmadd_pd(va, vb, acc);
}
// horizontal reduce acc
On a core with one 256-bit FMA port and 4-wide issue, scalar runs at roughly one FMA per 4 cycles (latency-bound reduction), while vector runs 4 lanes at once and -- with multiple accumulators -- can push 8+ flops per cycle. That is a 30x difference on the same algorithm.
Out-of-order in action:
ld x1, [a] # 100-cycle DRAM miss
add x2, x3, x4 # independent -- dispatches immediately
mul x5, x6, x7 # independent -- dispatches
add x8, x1, x9 # blocked on x1 (sits in ROB)
... # more independent work, also dispatches
The core keeps executing non-dependent work until the reorder buffer fills up. The dependent add waits without blocking everything behind it.
Common Confusion / Misconception
"Out-of-order means programs are not really deterministic." No. The architectural state visible to software is updated in program order at retire. The core may execute things in a different order internally, but anything a correct program observes matches the sequential semantics the ISA defines.
Another trap: "SIMD = automatic speedup if I enable -O3." Compilers auto-vectorize predictable, aligned, simple loops. Anything with data-dependent branches, pointer aliasing, or reduction patterns the compiler cannot prove reorderable often falls back to scalar. Checking -O3's assembly output (or -fopt-info-vec-missed) tells you what was vectorized.
How To Use It
- When a loop is arithmetic-heavy and memory-friendly, consider SIMD (compiler auto-vectorization or intrinsics).
- When a loop is dependency-chain-bound (a reduction that feeds itself), unroll with multiple accumulators to let the out-of-order engine do work in parallel.
- For indirect calls in hot paths, remember that the core may mispredict them; devirtualize or inline when possible.
- Use
llvm-mca,uiCA, or Intel VTune to see the static per-port utilization of a short hot loop. - Trust the ISA's memory model and consistency rules -- do not assume in-order at the machine level.
Check Yourself
- What is the difference between "issue width" and "retire width"?
- Why does out-of-order execution still preserve architectural correctness?
- When does SIMD help and when does it not?
- How does speculation interact with branch prediction to keep a deep pipeline busy?
Mini Drill or Application
- Take a naive
sumloop and add multiple accumulators (s0, s1, s2, s3). Benchmark before and after. Explain why the unrolled version is faster even though it executes the same number ofadds. - Use
llvm-mca(oruiCAonline) on a ~10-instruction loop and identify the bottleneck port. - Read about a specific microarchitecture on Agner Fog's microarchitecture manual and note the issue width, number of integer ALU ports, and FP pipe throughput.
- Force a mispredict-heavy path (a 50/50 data-dependent branch) and watch
branch-missesballoon inperf stat. Explain why out-of-order execution does not rescue this case.
Speculation and Security
Speculation is not just a performance feature -- it is a security perimeter. Spectre- and Meltdown-class attacks exploit the fact that speculatively executed instructions can leave measurable microarchitectural side effects (cache lines, TLB entries) even after a misprediction rolls back the architectural state. The attacker then uses cache-timing probes to read those side effects. You do not need to defend against these at the C level yet -- but you should know they exist, because any deep understanding of CPUs past this point must include them.
Read This Only If Stuck
- Computer Organization and Design: 4.11 Real Stuff -- Opteron X4 Barcelona Pipeline
- Computer Organization and Design: 3.6 Parallelism and Computer Arithmetic -- Associativity
- Computer Organization and Design: 7.6 SISD/MIMD/SIMD/SPMD and Vector
- Computer Organization and Design: 7.6 SISD/MIMD/SIMD/SPMD and Vector (Part 2)
- External: Agner Fog, Software Optimization Resources