Superscalar, Out-of-Order, SIMD, and Speculation

What This Concept Is

Modern cores go beyond "one instruction per cycle" in four intersecting ways:

Superscalar issue -- fetch, decode, and issue multiple instructions per cycle through multiple execution ports. A 2-wide, 4-wide, or 8-wide core can retire that many instructions per cycle when dependencies allow.
Out-of-order execution -- the core reorders independent instructions around stalled ones via a reservation station / reorder buffer. Instructions retire in program order but execute as soon as their operands are ready.
SIMD (single instruction, multiple data) -- one instruction operates on a vector of lanes. On x86_64: SSE (128-bit), AVX/AVX2 (256-bit), AVX-512 (512-bit). On ARM: NEON and SVE. A 256-bit FMA can do 4 doubles per cycle, vs 1 for scalar.
Speculation -- branch prediction, indirect-call prediction, and memory-disambiguation prediction all let the core keep going past unresolved control and data, rolling back if the guess was wrong.

Together these turn a "simple" 5-stage pipeline into a machine that can sustain 3-4 IPC on well-tuned code and near zero IPC on pointer chases.

Why It Matters Here

This is a supporting concept because the cost model suddenly matters less at the instruction level and more at the resource level:

total issue bandwidth (e.g. 4 µops/cycle)
per-port throughput (e.g. one FP multiply per cycle on a specific port)
retirement bandwidth
the reorder buffer size (how much the core can speculate past a miss)

When a tight loop stops getting faster despite removing instructions, you are likely port-bound or ROB-bound, not instruction-bound.

Speculation is also the foundation under Spectre-style side channels -- a useful reason to understand it precisely.

Concrete Example

Consider a scalar vs SIMD dot product:

// Scalar
for (int i = 0; i < N; ++i) s += a[i] * b[i];

// AVX2 (manually, with intrinsics)
__m256d acc = _mm256_setzero_pd();
for (int i = 0; i < N; i += 4) {
    __m256d va = _mm256_loadu_pd(a + i);
    __m256d vb = _mm256_loadu_pd(b + i);
    acc = _mm256_fmadd_pd(va, vb, acc);
}
// horizontal reduce acc

On a core with one 256-bit FMA port and 4-wide issue, scalar runs at roughly one FMA per 4 cycles (latency-bound reduction), while vector runs 4 lanes at once and -- with multiple accumulators -- can push 8+ flops per cycle. That is a 30x difference on the same algorithm.

Out-of-order in action:

    ld     x1, [a]          # 100-cycle DRAM miss
    add    x2, x3, x4       # independent -- dispatches immediately
    mul    x5, x6, x7       # independent -- dispatches
    add    x8, x1, x9       # blocked on x1 (sits in ROB)
    ...                     # more independent work, also dispatches

The core keeps executing non-dependent work until the reorder buffer fills up. The dependent add waits without blocking everything behind it.

Common Confusion / Misconception

"Out-of-order means programs are not really deterministic." No. The architectural state visible to software is updated in program order at retire. The core may execute things in a different order internally, but anything a correct program observes matches the sequential semantics the ISA defines.

Another trap: "SIMD = automatic speedup if I enable -O3." Compilers auto-vectorize predictable, aligned, simple loops. Anything with data-dependent branches, pointer aliasing, or reduction patterns the compiler cannot prove reorderable often falls back to scalar. Checking -O3's assembly output (or -fopt-info-vec-missed) tells you what was vectorized.

How To Use It

When a loop is arithmetic-heavy and memory-friendly, consider SIMD (compiler auto-vectorization or intrinsics).
When a loop is dependency-chain-bound (a reduction that feeds itself), unroll with multiple accumulators to let the out-of-order engine do work in parallel.
For indirect calls in hot paths, remember that the core may mispredict them; devirtualize or inline when possible.
Use llvm-mca, uiCA, or Intel VTune to see the static per-port utilization of a short hot loop.
Trust the ISA's memory model and consistency rules -- do not assume in-order at the machine level.

Check Yourself

What is the difference between "issue width" and "retire width"?
Why does out-of-order execution still preserve architectural correctness?
When does SIMD help and when does it not?
How does speculation interact with branch prediction to keep a deep pipeline busy?

Mini Drill or Application

Take a naive sum loop and add multiple accumulators (s0, s1, s2, s3). Benchmark before and after. Explain why the unrolled version is faster even though it executes the same number of adds.
Use llvm-mca (or uiCA online) on a ~10-instruction loop and identify the bottleneck port.
Read about a specific microarchitecture on Agner Fog's microarchitecture manual and note the issue width, number of integer ALU ports, and FP pipe throughput.
Force a mispredict-heavy path (a 50/50 data-dependent branch) and watch branch-misses balloon in perf stat. Explain why out-of-order execution does not rescue this case.

Speculation and Security

Speculation is not just a performance feature -- it is a security perimeter. Spectre- and Meltdown-class attacks exploit the fact that speculatively executed instructions can leave measurable microarchitectural side effects (cache lines, TLB entries) even after a misprediction rolls back the architectural state. The attacker then uses cache-timing probes to read those side effects. You do not need to defend against these at the C level yet -- but you should know they exist, because any deep understanding of CPUs past this point must include them.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Speculation and Security​

Read This Only If Stuck​