Floating-Point Operations and Hardware Support

What This Concept Is

Floating-point (FP) numbers are a fixed-width, finite approximation of the real numbers. The IEEE-754 standard defines the bit layout, rounding rules, and special values (±0, ±∞, NaN) that virtually all modern CPUs implement. The two common widths are:

float / binary32: 1 sign bit + 8 exponent bits + 23 fraction bits (7 decimal digits)
double / binary64: 1 sign bit + 11 exponent bits + 52 fraction bits (15-17 decimal digits)

A value x = (-1)^s · 1.fraction · 2^(exponent - bias) (for normal numbers). The exponent is stored biased (127 for float, 1023 for double), so comparing FP numbers as integers can even work -- as long as the sign matches.

Hardware support lives in a separate unit (historically an "x87" coprocessor, today the SSE/AVX vector registers on x86_64 and the f/d register file on RISC-V with the F/D extensions). Instructions include addsd, mulsd, divsd, sqrtsd, fmadd, and their single-precision and vector cousins.

Why It Matters Here

FP is a supporting concept in this module because the cost model is different from integer arithmetic and the bugs are famously subtle:

FP multiply and add typically cost 3-5 cycles of latency but are fully pipelined, so one per cycle in throughput.
Division and square root are slow (10-40 cycles) and not well pipelined.
Fused multiply-add (fma) does a*b + c in one rounding step and is the core of matrix kernels.
Denormal inputs can silently slow an FP path by 100x on older cores (gradual underflow handled in microcode).
Associativity fails: (a + b) + c may not equal a + (b + c). Compilers only reorder FP with -ffast-math.

Concrete Example

double dot(const double *a, const double *b, int n) {
    double s = 0;
    for (int i = 0; i < n; ++i) s += a[i] * b[i];
    return s;
}

With gcc -O2 -mfma for x86_64:

dot:
    xorpd       %xmm0, %xmm0              # s = 0.0
    test        %esi, %esi
    jle         .Ldone
    ...
.Lloop:
    vmovsd      (%rdi), %xmm1             # load a[i]
    vfmadd231sd (%rsi), %xmm1, %xmm0      # s += a[i] * b[i]  (one rounding)
    add         $8, %rdi
    add         $8, %rsi
    cmp         %rdx, %rdi
    jne         .Lloop
.Ldone:
    ret

The loop issues one vfmadd per iteration. The critical-path latency is the 4-cycle FMA feeding back into itself through %xmm0 -- meaning throughput is limited to one iteration every 4 cycles unless the compiler unrolls and uses multiple accumulators. That unrolling is exactly what -O3 or hand-vectorization does.

Common Confusion / Misconception

"float is always faster than double." Per-lane latency on scalar FMA is often identical. float wins when (a) memory bandwidth matters (half the bytes) or (b) the CPU can pack 2x as many lanes into a SIMD register. It loses when precision loss turns one pass into three.

Also: 0.1 + 0.2 != 0.3 in IEEE-754 not because of a bug but because 0.1, 0.2, 0.3 all have infinite binary expansions. This is not fixable; it is the definition of finite-precision arithmetic.

How To Use It

When a workload is FP-heavy, check whether your inner loop uses FMA. If not, consider -mfma or an intrinsic.
If a profiler shows FP latency dominating, look for reduction variables (sum, dot, norm) and break them into multiple accumulators so the critical path shortens.
Avoid mixing FP and int in the same ALU chain; conversions (cvtsi2sd, cvttsd2si) cost a few cycles and break vectorization.
For correctness: do not use == on FP. Use fabs(a - b) < eps or a relative-error bound.
For denormals: set "flush-to-zero" (FTZ) and "denormals-are-zero" (DAZ) in performance-sensitive code where the last few ulp do not matter.

Check Yourself

Why is 0.1 + 0.2 == 0.3 false in IEEE-754 binary64?
What does a fused multiply-add do that a separate multiply and add cannot?
When does float outperform double on modern hardware?
Why does a reduction sum += a[i] * b[i] have a long critical path, and how does unrolling with multiple accumulators fix it?

Mini Drill or Application

Write a tiny program that prints (double)0.1 + (double)0.2 - (double)0.3 and explain the result in terms of bit layout.
In Compiler Explorer, compile a float dot-product and a double dot-product with -O3 -mavx2. Count the FP ops per vector register and compare.
Measure a sqrt loop and a rsqrt-approximation loop; note the trade-off between throughput and accuracy.
Use perf stat -e fp_arith_inst_retired.scalar_double,fp_arith_inst_retired.256b_packed_double (or the equivalent on your CPU) to distinguish scalar from SIMD FP ops in a compiled kernel.

Gotchas Worth Remembering

Signed zero: +0.0 and -0.0 compare equal but have distinct bit patterns. 1/+0.0 is +inf; 1/-0.0 is -inf.
NaN propagation: any arithmetic with NaN produces NaN. This makes NaN a useful "poison" for catching uninitialized FP values.
-ffast-math caveats: it enables reassociation and drops strict NaN/inf handling. Great for physics kernels; disastrous in financial code.
Integer-to-FP conversions: cvtsi2sd has a false dependency on the destination register on some Intel microarchitectures. Zeroing the register first (xorpd) avoids it.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Gotchas Worth Remembering​

Read This Only If Stuck​