Floating-Point IEEE 754: Sign, Exponent, Mantissa, Special Values
What This Concept Is
IEEE 754 is the format that every mainstream CPU uses to store float and double. A number is split into three fields:
- sign (1 bit):
0for non-negative,1for negative - exponent (8 bits for
float, 11 fordouble), stored with a bias (127forfloat,1023fordouble) - mantissa a.k.a. significand (23 bits for
float, 52 fordouble), representing the fraction of1.xxx...
The value of a normal number is:
(-1)^sign * 1.mantissa_bits * 2^(exponent - bias)
Three special ranges of the exponent field carry non-numeric meanings:
- all-zeros exponent: zero (mantissa zero) or subnormal (mantissa nonzero)
- all-ones exponent with zero mantissa:
+Infor-Inf - all-ones exponent with nonzero mantissa:
NaN(not a number)
Why It Matters Here
You need this whenever you touch numbers that are not integers:
- finance, simulation, machine learning, graphics, networking timers
- bit-level tricks like fast inverse square root
- wire formats and serialization (where endianness and padding matter)
- debugging tests that compare floats with
==and fail mysteriously
If you treat a float as an exact decimal, you will ship bugs.
Concrete Example
Encoding 0.15625 as a 32-bit float:
0.15625 = 1/8 + 1/32 = 0.00101_2 = 1.01_2 * 2^-3.
- sign =
0 - exponent =
-3 + 127 = 124 = 0b01111100 - mantissa (23 bits) =
01000000000000000000000
Concatenated: 0 01111100 01000000000000000000000 = 0x3E200000.
And 0.1 has no finite binary expansion: 0.1_10 = 0.00011001100110011..._2. Stored as a float, it is rounded; adding three such rounded copies does not exactly equal 0.3:
printf("%.20f\n", 0.1 + 0.2); /* 0.30000000000000004441 */
That is the canonical "floating point is not decimal" surprise.
Common Confusion / Misconception
"Floats are just inexact doubles." They are a different precision with different range: float has ~7 decimal digits of precision, double ~15-17. Silent float <-> double conversions lose precision at every boundary.
"I should test x == 0.0." Often wrong. Use a tolerance: fabs(x) < epsilon where epsilon is chosen from the problem. NaN breaks == entirely: NaN == NaN is false. Use isnan(x).
"Floating-point addition is associative." No. (a + b) + c != a + (b + c) can hold for finite non-special values because rounding depends on magnitude.
How To Use It
Whenever floats show up:
- Know the format (
float32-bit,double64-bit) and the rough precision. - Compare with a tolerance, not with
==. Useisnan,isinf,isfinite. - Avoid subtracting nearly equal numbers ("catastrophic cancellation") when you can.
- For wire formats, send the bit pattern (
uint32_t/uint64_t) with an agreed endianness.
Check Yourself
- Why is
1.0exact but0.1not? - What does the bit pattern
0x7F800000mean as afloat? - Why is
NaN == NaNfalse?
Mini Drill or Application
#include <stdio.h>
#include <string.h>
#include <stdint.h>
int main(void) {
float f = 0.15625f;
uint32_t bits;
memcpy(&bits, &f, sizeof bits);
printf("%.6f -> 0x%08x\n", f, bits);
float nan = 0.0f / 0.0f;
printf("nan == nan is %s\n", (nan == nan) ? "true" : "false");
return 0;
}
Compile: gcc -Wall -Wextra -o ieee ieee.c. Predict the hex pattern before running. Change the value to -0.15625f and to 0.1f; explain each hex result.