Skip to main content

Floating-Point Operations and Hardware Support

What This Concept Is

Floating-point (FP) numbers are a fixed-width, finite approximation of the real numbers. The IEEE-754 standard defines the bit layout, rounding rules, and special values (±0, ±∞, NaN) that virtually all modern CPUs implement. The two common widths are:

  • float / binary32: 1 sign bit + 8 exponent bits + 23 fraction bits (7 decimal digits)
  • double / binary64: 1 sign bit + 11 exponent bits + 52 fraction bits (15-17 decimal digits)

A value x = (-1)^s · 1.fraction · 2^(exponent - bias) (for normal numbers). The exponent is stored biased (127 for float, 1023 for double), so comparing FP numbers as integers can even work -- as long as the sign matches.

Hardware support lives in a separate unit (historically an "x87" coprocessor, today the SSE/AVX vector registers on x86_64 and the f/d register file on RISC-V with the F/D extensions). Instructions include addsd, mulsd, divsd, sqrtsd, fmadd, and their single-precision and vector cousins.

Why It Matters Here

FP is a supporting concept in this module because the cost model is different from integer arithmetic and the bugs are famously subtle:

  • FP multiply and add typically cost 3-5 cycles of latency but are fully pipelined, so one per cycle in throughput.
  • Division and square root are slow (10-40 cycles) and not well pipelined.
  • Fused multiply-add (fma) does a*b + c in one rounding step and is the core of matrix kernels.
  • Denormal inputs can silently slow an FP path by 100x on older cores (gradual underflow handled in microcode).
  • Associativity fails: (a + b) + c may not equal a + (b + c). Compilers only reorder FP with -ffast-math.

Concrete Example

double dot(const double *a, const double *b, int n) {
double s = 0;
for (int i = 0; i < n; ++i) s += a[i] * b[i];
return s;
}

With gcc -O2 -mfma for x86_64:

dot:
xorpd %xmm0, %xmm0 # s = 0.0
test %esi, %esi
jle .Ldone
...
.Lloop:
vmovsd (%rdi), %xmm1 # load a[i]
vfmadd231sd (%rsi), %xmm1, %xmm0 # s += a[i] * b[i] (one rounding)
add $8, %rdi
add $8, %rsi
cmp %rdx, %rdi
jne .Lloop
.Ldone:
ret

The loop issues one vfmadd per iteration. The critical-path latency is the 4-cycle FMA feeding back into itself through %xmm0 -- meaning throughput is limited to one iteration every 4 cycles unless the compiler unrolls and uses multiple accumulators. That unrolling is exactly what -O3 or hand-vectorization does.

Common Confusion / Misconception

"float is always faster than double." Per-lane latency on scalar FMA is often identical. float wins when (a) memory bandwidth matters (half the bytes) or (b) the CPU can pack 2x as many lanes into a SIMD register. It loses when precision loss turns one pass into three.

Also: 0.1 + 0.2 != 0.3 in IEEE-754 not because of a bug but because 0.1, 0.2, 0.3 all have infinite binary expansions. This is not fixable; it is the definition of finite-precision arithmetic.

How To Use It

  • When a workload is FP-heavy, check whether your inner loop uses FMA. If not, consider -mfma or an intrinsic.
  • If a profiler shows FP latency dominating, look for reduction variables (sum, dot, norm) and break them into multiple accumulators so the critical path shortens.
  • Avoid mixing FP and int in the same ALU chain; conversions (cvtsi2sd, cvttsd2si) cost a few cycles and break vectorization.
  • For correctness: do not use == on FP. Use fabs(a - b) < eps or a relative-error bound.
  • For denormals: set "flush-to-zero" (FTZ) and "denormals-are-zero" (DAZ) in performance-sensitive code where the last few ulp do not matter.

Check Yourself

  1. Why is 0.1 + 0.2 == 0.3 false in IEEE-754 binary64?
  2. What does a fused multiply-add do that a separate multiply and add cannot?
  3. When does float outperform double on modern hardware?
  4. Why does a reduction sum += a[i] * b[i] have a long critical path, and how does unrolling with multiple accumulators fix it?

Mini Drill or Application

  1. Write a tiny program that prints (double)0.1 + (double)0.2 - (double)0.3 and explain the result in terms of bit layout.
  2. In Compiler Explorer, compile a float dot-product and a double dot-product with -O3 -mavx2. Count the FP ops per vector register and compare.
  3. Measure a sqrt loop and a rsqrt-approximation loop; note the trade-off between throughput and accuracy.
  4. Use perf stat -e fp_arith_inst_retired.scalar_double,fp_arith_inst_retired.256b_packed_double (or the equivalent on your CPU) to distinguish scalar from SIMD FP ops in a compiled kernel.

Gotchas Worth Remembering

  • Signed zero: +0.0 and -0.0 compare equal but have distinct bit patterns. 1/+0.0 is +inf; 1/-0.0 is -inf.
  • NaN propagation: any arithmetic with NaN produces NaN. This makes NaN a useful "poison" for catching uninitialized FP values.
  • -ffast-math caveats: it enables reassociation and drops strict NaN/inf handling. Great for physics kernels; disastrous in financial code.
  • Integer-to-FP conversions: cvtsi2sd has a false dependency on the destination register on some Intel microarchitectures. Zeroing the register first (xorpd) avoids it.

Read This Only If Stuck