Floating-Point IEEE 754: Sign, Exponent, Mantissa, Special Values

What This Concept Is

IEEE 754 is the format that every mainstream CPU uses to store float and double. A number is split into three fields:

sign (1 bit): 0 for non-negative, 1 for negative
exponent (8 bits for float, 11 for double), stored with a bias (127 for float, 1023 for double)
mantissa a.k.a. significand (23 bits for float, 52 for double), representing the fraction of 1.xxx...

The value of a normal number is:

(-1)^sign * 1.mantissa_bits * 2^(exponent - bias)

Three special ranges of the exponent field carry non-numeric meanings:

all-zeros exponent: zero (mantissa zero) or subnormal (mantissa nonzero)
all-ones exponent with zero mantissa: +Inf or -Inf
all-ones exponent with nonzero mantissa: NaN (not a number)

Why It Matters Here

You need this whenever you touch numbers that are not integers:

finance, simulation, machine learning, graphics, networking timers
bit-level tricks like fast inverse square root
wire formats and serialization (where endianness and padding matter)
debugging tests that compare floats with == and fail mysteriously

If you treat a float as an exact decimal, you will ship bugs.

Concrete Example

Encoding 0.15625 as a 32-bit float:

0.15625 = 1/8 + 1/32 = 0.00101_2 = 1.01_2 * 2^-3.

sign = 0
exponent = -3 + 127 = 124 = 0b01111100
mantissa (23 bits) = 01000000000000000000000

Concatenated: 0 01111100 01000000000000000000000 = 0x3E200000.

And 0.1 has no finite binary expansion: 0.1_10 = 0.00011001100110011..._2. Stored as a float, it is rounded; adding three such rounded copies does not exactly equal 0.3:

printf("%.20f\n", 0.1 + 0.2);  /* 0.30000000000000004441 */

That is the canonical "floating point is not decimal" surprise.

Common Confusion / Misconception

"Floats are just inexact doubles." They are a different precision with different range: float has ~7 decimal digits of precision, double ~15-17. Silent float <-> double conversions lose precision at every boundary.

"I should test x == 0.0." Often wrong. Use a tolerance: fabs(x) < epsilon where epsilon is chosen from the problem. NaN breaks == entirely: NaN == NaN is false. Use isnan(x).

"Floating-point addition is associative." No. (a + b) + c != a + (b + c) can hold for finite non-special values because rounding depends on magnitude.

How To Use It

Whenever floats show up:

Know the format (float 32-bit, double 64-bit) and the rough precision.
Compare with a tolerance, not with ==. Use isnan, isinf, isfinite.
Avoid subtracting nearly equal numbers ("catastrophic cancellation") when you can.
For wire formats, send the bit pattern (uint32_t / uint64_t) with an agreed endianness.

Check Yourself

Why is 1.0 exact but 0.1 not?
What does the bit pattern 0x7F800000 mean as a float?
Why is NaN == NaN false?

Mini Drill or Application

#include <stdio.h>
#include <string.h>
#include <stdint.h>

int main(void) {
    float f = 0.15625f;
    uint32_t bits;
    memcpy(&bits, &f, sizeof bits);
    printf("%.6f -> 0x%08x\n", f, bits);

    float nan = 0.0f / 0.0f;
    printf("nan == nan is %s\n", (nan == nan) ? "true" : "false");
    return 0;
}

Compile: gcc -Wall -Wextra -o ieee ieee.c. Predict the hex pattern before running. Change the value to -0.15625f and to 0.1f; explain each hex result.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​