Learn Floating-Point Representations | Numerical Foundations of Quantization

Swipe to show menu

Floating-point representations are a fundamental part of neural network computation, as they define how real numbers are stored and manipulated in computer hardware. The most common formats used in deep learning are FP32 (single-precision), FP16 (half-precision), and BF16 (bfloat16). Each format encodes a real number using three main components: the mantissa (also called significand), the exponent, and the bias.

Mathematically, any normalized floating-point number can be represented as:

\text{value} = -1^{\text{sign}} × 1.\text{mantissa} × 2^{(\text{exponent} - \text{bias})}

The sign is a single bit that determines whether the number is positive or negative;
The mantissa holds the significant digits of the number;
The exponent determines the scale (how large or small the number can be);
The bias is a constant added to the exponent to allow both positive and negative exponents.

In FP32, the IEEE 754 standard allocates 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. FP16 uses 1 sign bit, 5 exponent bits, and 10 mantissa bits. BF16, designed for machine learning, uses 1 sign bit, 8 exponent bits (like FP32), but only 7 mantissa bits. This structure directly impacts both the range of values (dynamic range) and the smallest difference between representable numbers (precision).

The dynamic range of a floating-point format describes the range between the smallest and largest numbers it can represent. Precision refers to how finely numbers can be distinguished within that range. FP32, with its 23 mantissa bits and 8 exponent bits, can represent numbers as small as approximately $1.18 × 10^{-38}$ and as large as $3.4 × 10^{38}$ . FP16, with fewer exponent and mantissa bits, covers a smaller range: from about $6.1 × 10^{-5}$ to $6.5 × 10^{4}$ . BF16, thanks to its 8 exponent bits, matches FP32's range ( $~3.4 × 10^{38}$ ) but with lower precision due to only 7 mantissa bits.

The smallest positive normalized value in FP32 is calculated as:
$min_{normal} = 2^{(1 - bias)} = 2^{(1 - 127)} ≈ 1.18 × 10^{-38};$
The largest value in FP16 is:
$max = (2 - 2^{-10}) × 2^{15} ≈ 6.55 × 10^{4};$
Precision is determined by the mantissa. In FP32, the smallest difference between two numbers near 1.0 is approximately $2^{-23} ≈ 1.19 × 10^{-7}$ ;
In FP16, it is $2^{-10} ≈ 9.77 × 10^{-4}$ ;
BF16's smallest step is $2^{-7} ≈ 7.81 × 10^{-3}$ , much coarser than FP32 or FP16.

This means that while BF16 can represent very large or very small numbers, it cannot distinguish between values as finely as FP32 or even FP16.

Definition

In floating-point arithmetic, underflow occurs when a number is too small in magnitude to be represented as a normalized value, resulting in zero or a subnormal number. Overflow happens when a number is too large to be represented, resulting in infinity. Subnormal numbers (also called denormals) fill the gap between zero and the smallest normalized value, allowing for gradual loss of precision instead of an abrupt jump to zero.

Choosing a floating-point format for neural network computations involves a trade-off between stability and precision. More mantissa bits (as in FP32) allow for higher precision, reducing round-off errors and improving numerical stability during training, especially in deep or sensitive models. Fewer mantissa bits (as in FP16 or BF16) increase the risk of rounding errors, which can accumulate and destabilize learning, but they reduce memory and computational costs.

The number of exponent bits determines the dynamic range. FP32 and BF16, with 8 exponent bits, can represent much larger and smaller numbers than FP16, making them better suited for models with large activations or gradients. However, BF16's low mantissa precision means that, while it can avoid overflow and underflow, it may still lose fine detail in calculations.

Bit allocation in each format thus directly influences both the stability of training and the efficiency of inference. Understanding these trade-offs is essential for selecting the right format for a given neural network workload.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 1