Floating-Point Operations and Hardware Support
What This Concept Is
Floating-point (FP) numbers are a fixed-width, finite approximation of the real numbers. The IEEE-754 standard defines the bit layout, rounding rules, and special values (±0, ±∞, NaN) that virtually all modern CPUs implement. The two common widths are:
float/ binary32: 1 sign bit + 8 exponent bits + 23 fraction bits (7 decimal digits)double/ binary64: 1 sign bit + 11 exponent bits + 52 fraction bits (15-17 decimal digits)
A value x = (-1)^s · 1.fraction · 2^(exponent - bias) (for normal numbers). The exponent is stored biased (127 for float, 1023 for double), so comparing FP numbers as integers can even work -- as long as the sign matches.
Hardware support lives in a separate unit (historically an "x87" coprocessor, today the SSE/AVX vector registers on x86_64 and the f/d register file on RISC-V with the F/D extensions). Instructions include addsd, mulsd, divsd, sqrtsd, fmadd, and their single-precision and vector cousins.
Why It Matters Here
FP is a supporting concept in this module because the cost model is different from integer arithmetic and the bugs are famously subtle:
- FP multiply and add typically cost 3-5 cycles of latency but are fully pipelined, so one per cycle in throughput.
- Division and square root are slow (10-40 cycles) and not well pipelined.
- Fused multiply-add (
fma) doesa*b + cin one rounding step and is the core of matrix kernels. - Denormal inputs can silently slow an FP path by 100x on older cores (gradual underflow handled in microcode).
- Associativity fails:
(a + b) + cmay not equala + (b + c). Compilers only reorder FP with-ffast-math.
Concrete Example
double dot(const double *a, const double *b, int n) {
double s = 0;
for (int i = 0; i < n; ++i) s += a[i] * b[i];
return s;
}
With gcc -O2 -mfma for x86_64:
dot:
xorpd %xmm0, %xmm0 # s = 0.0
test %esi, %esi
jle .Ldone
...
.Lloop:
vmovsd (%rdi), %xmm1 # load a[i]
vfmadd231sd (%rsi), %xmm1, %xmm0 # s += a[i] * b[i] (one rounding)
add $8, %rdi
add $8, %rsi
cmp %rdx, %rdi
jne .Lloop
.Ldone:
ret
The loop issues one vfmadd per iteration. The critical-path latency is the 4-cycle FMA feeding back into itself through %xmm0 -- meaning throughput is limited to one iteration every 4 cycles unless the compiler unrolls and uses multiple accumulators. That unrolling is exactly what -O3 or hand-vectorization does.
Common Confusion / Misconception
"float is always faster than double." Per-lane latency on scalar FMA is often identical. float wins when (a) memory bandwidth matters (half the bytes) or (b) the CPU can pack 2x as many lanes into a SIMD register. It loses when precision loss turns one pass into three.
Also: 0.1 + 0.2 != 0.3 in IEEE-754 not because of a bug but because 0.1, 0.2, 0.3 all have infinite binary expansions. This is not fixable; it is the definition of finite-precision arithmetic.
How To Use It
- When a workload is FP-heavy, check whether your inner loop uses FMA. If not, consider
-mfmaor an intrinsic. - If a profiler shows FP latency dominating, look for reduction variables (
sum,dot,norm) and break them into multiple accumulators so the critical path shortens. - Avoid mixing FP and int in the same ALU chain; conversions (
cvtsi2sd,cvttsd2si) cost a few cycles and break vectorization. - For correctness: do not use
==on FP. Usefabs(a - b) < epsor a relative-error bound. - For denormals: set "flush-to-zero" (FTZ) and "denormals-are-zero" (DAZ) in performance-sensitive code where the last few ulp do not matter.
Check Yourself
- Why is
0.1 + 0.2 == 0.3false in IEEE-754 binary64? - What does a fused multiply-add do that a separate multiply and add cannot?
- When does
floatoutperformdoubleon modern hardware? - Why does a reduction
sum += a[i] * b[i]have a long critical path, and how does unrolling with multiple accumulators fix it?
Mini Drill or Application
- Write a tiny program that prints
(double)0.1 + (double)0.2 - (double)0.3and explain the result in terms of bit layout. - In Compiler Explorer, compile a
floatdot-product and adoubledot-product with-O3 -mavx2. Count the FP ops per vector register and compare. - Measure a
sqrtloop and arsqrt-approximation loop; note the trade-off between throughput and accuracy. - Use
perf stat -e fp_arith_inst_retired.scalar_double,fp_arith_inst_retired.256b_packed_double(or the equivalent on your CPU) to distinguish scalar from SIMD FP ops in a compiled kernel.
Gotchas Worth Remembering
- Signed zero:
+0.0and-0.0compare equal but have distinct bit patterns.1/+0.0is+inf;1/-0.0is-inf. - NaN propagation: any arithmetic with
NaNproducesNaN. This makesNaNa useful "poison" for catching uninitialized FP values. -ffast-mathcaveats: it enables reassociation and drops strict NaN/inf handling. Great for physics kernels; disastrous in financial code.- Integer-to-FP conversions:
cvtsi2sdhas a false dependency on the destination register on some Intel microarchitectures. Zeroing the register first (xorpd) avoids it.