Skip to main content

How the Hardware-Software Contract Shapes Performance

What This Concept Is

The hardware-software contract is the set of promises the ISA and its surrounding microarchitecture make to software. It is what lets compilers and programs stay stable across hardware generations while real performance changes silently underneath.

The contract has three tiers, each with a different performance character:

  • Architectural (ISA-level): the instructions, register file, memory-ordering rules, and exception model. These are functionally binding: a correct program keeps working.
  • Microarchitectural: pipeline depth, issue width, cache sizes, branch predictors, TLB entries, memory controllers. These are not binding -- but they decide throughput.
  • Platform: OS scheduling, NUMA topology, thermal limits, frequency scaling. These can change measured performance by 2x with no code changes.

Three classic reasoning tools connect software decisions to wall-clock outcomes:

  • Amdahl's law -- if fraction p of a workload is parallel and 1-p is serial, speedup from N parallel workers is 1 / ((1-p) + p/N). The serial fraction caps total speedup.
  • CPU performance equation -- time = instructions × CPI × cycle_time. Optimize any one factor, accepting that the others may move.
  • Roofline model -- plot attainable performance vs arithmetic intensity (flops per byte loaded). Below the ridge, memory bandwidth bounds you; above, peak compute does. Which bound you are against determines which optimization to pursue.

Why It Matters Here

This concept is the closing frame of the module. Every individual topic -- ISA, cache, pipeline, TLB -- is only useful when you can point at a measurement and say which layer is the current bottleneck and why. Without this frame, optimization devolves into cargo cult: prefetches that do nothing, SIMD on memory-bound loops, parallelism on serial critical paths.

Concrete Example

Dense matrix multiply on a modern desktop:

  • Peak scalar double-precision flops/cycle = 4 (one FMA × 2 flops × 2 lanes? -- depends on microarch; typical ~8-16 on AVX2/AVX-512)
  • DRAM bandwidth = ~25 GB/s
  • Arithmetic intensity for unblocked matmul = O(1) flops/byte (each element of each input is loaded once per output computation at best, and usually more)

Plug into the roofline: attainable GFLOPS ≈ min(peak_compute, intensity × bandwidth). At intensity = 0.5, you are pinned to ~12.5 GFLOPS -- which is about 5-10% of peak. Blocking raises intensity to ~8-16 flops/byte (reusing each block from L1), which moves the roofline to the compute-bound side. That is exactly why blocked matmul reaches near-peak.

Another: Amdahl's cruel reality. A workload that is 95% parallel looks great until you try to scale to 32 cores: 1 / (0.05 + 0.95/32) ≈ 12.5x. The other 19.5x speedup was lost to the 5% serial fraction. The lesson is not "parallelism is bad"; it is that you must attack the serial fraction first.

Common Confusion / Misconception

"Performance is one number." It is not. A workload has:

  • a latency (time to finish one unit of work, single-threaded)
  • a throughput (units per second at steady state)
  • a tail latency (p99, p999 -- what the worst few percent look like)
  • an energy cost per unit of work

Optimizing one of these often hurts the others. A cache miss-heavy workload has decent throughput on a multi-core because while one thread is stalled, others run; single-thread latency, however, is bad. Understand which number the user cares about before you start tuning.

How To Use It

  1. Before tuning, answer: what number am I optimizing, and what is the theoretical ceiling? Use roofline and the CPU equation.
  2. Measure before changing. perf stat gives you cycles, instructions, branch-misses, cache-misses, dTLB-misses in one command. perf record/perf report tells you where.
  3. Attack the dominant term first. A program bottlenecked on cache misses does not benefit from cleverer branches; one bottlenecked on mispredicts does not benefit from SIMD.
  4. Keep an architecture-aware mental model: if you double a working set, you may cross a cache boundary and the cost curve bends. Expect it; measure for it.
  5. Beware portable-looking benchmarks. A number from a laptop with boost clocks, different cache sizes, and power-saving governors is not the number you get on production hardware. Re-benchmark on target.

Check Yourself

  1. State Amdahl's law and describe a program where the serial fraction would dominate at 16 cores.
  2. Write the CPU performance equation and describe a change that improves one factor at the cost of another.
  3. What is the roofline model, and how does arithmetic intensity decide which bound applies?
  4. Why is "performance is one number" a misleading frame? Give two axes that can disagree.

Mini Drill or Application

Pick a real program you have written (or a microbenchmark from earlier in the module). Run:

perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses,dTLB-load-misses,dTLB-loads ./program

From the counters, compute:

  • IPC (instructions / cycles)
  • branch-miss rate (branch-misses / branches)
  • cache-miss rate (cache-misses / cache-references)
  • TLB-miss rate (dTLB-load-misses / dTLB-loads)

Write a one-paragraph verdict: is this program front-end stall-bound, back-end memory-bound, compute-bound, or branch-bound? What would you try next?

Then, using the same binary, change one thing -- enable -O3, add -march=native, shrink the working set, pin to one core with taskset -c 0, or disable Turbo Boost -- and rerun. Explain which counter moved and why. The goal is to build the habit of predicting the shift before you see it.

How This Ties the Module Together

  • Cluster 1 taught you what the machine promises: instructions, registers, control flow.
  • Clusters 2-3 taught you the cost model: cycles per op, memory as a hierarchy.
  • Cluster 4 taught you parallel execution within one thread: pipelines and SIMD.
  • Cluster 5 taught you the wider system: I/O, virtual memory, and now the contract that binds them.

Everything after this module -- operating systems, databases, distributed systems, ML systems -- is a variation on the same theme: choose the right level of the contract to optimize, and measure to know when you have crossed a ceiling.

Read This Only If Stuck