Skip to main content

Floating-Point IEEE 754: Sign, Exponent, Mantissa, Special Values

What This Concept Is

IEEE 754 is the format that every mainstream CPU uses to store float and double. A number is split into three fields:

  • sign (1 bit): 0 for non-negative, 1 for negative
  • exponent (8 bits for float, 11 for double), stored with a bias (127 for float, 1023 for double)
  • mantissa a.k.a. significand (23 bits for float, 52 for double), representing the fraction of 1.xxx...

The value of a normal number is:

(-1)^sign * 1.mantissa_bits * 2^(exponent - bias)

Three special ranges of the exponent field carry non-numeric meanings:

  • all-zeros exponent: zero (mantissa zero) or subnormal (mantissa nonzero)
  • all-ones exponent with zero mantissa: +Inf or -Inf
  • all-ones exponent with nonzero mantissa: NaN (not a number)

Why It Matters Here

You need this whenever you touch numbers that are not integers:

  • finance, simulation, machine learning, graphics, networking timers
  • bit-level tricks like fast inverse square root
  • wire formats and serialization (where endianness and padding matter)
  • debugging tests that compare floats with == and fail mysteriously

If you treat a float as an exact decimal, you will ship bugs.

Concrete Example

Encoding 0.15625 as a 32-bit float:

0.15625 = 1/8 + 1/32 = 0.00101_2 = 1.01_2 * 2^-3.

  • sign = 0
  • exponent = -3 + 127 = 124 = 0b01111100
  • mantissa (23 bits) = 01000000000000000000000

Concatenated: 0 01111100 01000000000000000000000 = 0x3E200000.

And 0.1 has no finite binary expansion: 0.1_10 = 0.00011001100110011..._2. Stored as a float, it is rounded; adding three such rounded copies does not exactly equal 0.3:

printf("%.20f\n", 0.1 + 0.2);  /* 0.30000000000000004441 */

That is the canonical "floating point is not decimal" surprise.

Common Confusion / Misconception

"Floats are just inexact doubles." They are a different precision with different range: float has ~7 decimal digits of precision, double ~15-17. Silent float <-> double conversions lose precision at every boundary.

"I should test x == 0.0." Often wrong. Use a tolerance: fabs(x) < epsilon where epsilon is chosen from the problem. NaN breaks == entirely: NaN == NaN is false. Use isnan(x).

"Floating-point addition is associative." No. (a + b) + c != a + (b + c) can hold for finite non-special values because rounding depends on magnitude.

How To Use It

Whenever floats show up:

  1. Know the format (float 32-bit, double 64-bit) and the rough precision.
  2. Compare with a tolerance, not with ==. Use isnan, isinf, isfinite.
  3. Avoid subtracting nearly equal numbers ("catastrophic cancellation") when you can.
  4. For wire formats, send the bit pattern (uint32_t / uint64_t) with an agreed endianness.

Check Yourself

  1. Why is 1.0 exact but 0.1 not?
  2. What does the bit pattern 0x7F800000 mean as a float?
  3. Why is NaN == NaN false?

Mini Drill or Application

#include <stdio.h>
#include <string.h>
#include <stdint.h>

int main(void) {
float f = 0.15625f;
uint32_t bits;
memcpy(&bits, &f, sizeof bits);
printf("%.6f -> 0x%08x\n", f, bits);

float nan = 0.0f / 0.0f;
printf("nan == nan is %s\n", (nan == nan) ? "true" : "false");
return 0;
}

Compile: gcc -Wall -Wextra -o ieee ieee.c. Predict the hex pattern before running. Change the value to -0.15625f and to 0.1f; explain each hex result.

Read This Only If Stuck