Floating-Point Representations
Floating-point representations are a fundamental part of neural network computation, as they define how real numbers are stored and manipulated in computer hardware. The most common formats used in deep learning are FP32 (single-precision), FP16 (half-precision), and BF16 (bfloat16). Each format encodes a real number using three main components: the mantissa (also called significand), the exponent, and the bias.
Mathematically, any normalized floating-point number can be represented as:
value=β1signΓ1.mantissaΓ2(exponentβbias)- The sign is a single bit that determines whether the number is positive or negative;
- The mantissa holds the significant digits of the number;
- The exponent determines the scale (how large or small the number can be);
- The bias is a constant added to the exponent to allow both positive and negative exponents.
In FP32, the IEEE 754 standard allocates 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. FP16 uses 1 sign bit, 5 exponent bits, and 10 mantissa bits. BF16, designed for machine learning, uses 1 sign bit, 8 exponent bits (like FP32), but only 7 mantissa bits. This structure directly impacts both the range of values (dynamic range) and the smallest difference between representable numbers (precision).
The dynamic range of a floating-point format describes the range between the smallest and largest numbers it can represent. Precision refers to how finely numbers can be distinguished within that range. FP32, with its 23 mantissa bits and 8 exponent bits, can represent numbers as small as approximately 1.18Γ10β38 and as large as 3.4Γ1038. FP16, with fewer exponent and mantissa bits, covers a smaller range: from about 6.1Γ10β5 to 6.5Γ104. BF16, thanks to its 8 exponent bits, matches FP32's range (Β 3.4Γ1038) but with lower precision due to only 7 mantissa bits.
-
The smallest positive normalized value in
minnormalβ=2(1βbias)=2(1β127)β1.18Γ10β38;FP32is calculated as: -
The largest value in
max=(2β2β10)Γ215β6.55Γ104;FP16is: -
Precision is determined by the mantissa. In
FP32, the smallest difference between two numbers near 1.0 is approximately 2β23β1.19Γ10β7; -
In
FP16, it is 2β10β9.77Γ10β4; -
BF16's smallest step is 2β7β7.81Γ10β3, much coarser thanFP32orFP16.
This means that while BF16 can represent very large or very small numbers, it cannot distinguish between values as finely as FP32 or even FP16.
In floating-point arithmetic, underflow occurs when a number is too small in magnitude to be represented as a normalized value, resulting in zero or a subnormal number. Overflow happens when a number is too large to be represented, resulting in infinity. Subnormal numbers (also called denormals) fill the gap between zero and the smallest normalized value, allowing for gradual loss of precision instead of an abrupt jump to zero.
Choosing a floating-point format for neural network computations involves a trade-off between stability and precision. More mantissa bits (as in FP32) allow for higher precision, reducing round-off errors and improving numerical stability during training, especially in deep or sensitive models. Fewer mantissa bits (as in FP16 or BF16) increase the risk of rounding errors, which can accumulate and destabilize learning, but they reduce memory and computational costs.
The number of exponent bits determines the dynamic range. FP32 and BF16, with 8 exponent bits, can represent much larger and smaller numbers than FP16, making them better suited for models with large activations or gradients. However, BF16's low mantissa precision means that, while it can avoid overflow and underflow, it may still lose fine detail in calculations.
Bit allocation in each format thus directly influences both the stability of training and the efficiency of inference. Understanding these trade-offs is essential for selecting the right format for a given neural network workload.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain why BF16 is popular in machine learning despite its lower precision?
How do I decide which floating-point format to use for my neural network?
What are some practical consequences of using lower precision formats like FP16 or BF16 during training?
Awesome!
Completion rate improved to 11.11
Floating-Point Representations
Swipe to show menu
Floating-point representations are a fundamental part of neural network computation, as they define how real numbers are stored and manipulated in computer hardware. The most common formats used in deep learning are FP32 (single-precision), FP16 (half-precision), and BF16 (bfloat16). Each format encodes a real number using three main components: the mantissa (also called significand), the exponent, and the bias.
Mathematically, any normalized floating-point number can be represented as:
value=β1signΓ1.mantissaΓ2(exponentβbias)- The sign is a single bit that determines whether the number is positive or negative;
- The mantissa holds the significant digits of the number;
- The exponent determines the scale (how large or small the number can be);
- The bias is a constant added to the exponent to allow both positive and negative exponents.
In FP32, the IEEE 754 standard allocates 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. FP16 uses 1 sign bit, 5 exponent bits, and 10 mantissa bits. BF16, designed for machine learning, uses 1 sign bit, 8 exponent bits (like FP32), but only 7 mantissa bits. This structure directly impacts both the range of values (dynamic range) and the smallest difference between representable numbers (precision).
The dynamic range of a floating-point format describes the range between the smallest and largest numbers it can represent. Precision refers to how finely numbers can be distinguished within that range. FP32, with its 23 mantissa bits and 8 exponent bits, can represent numbers as small as approximately 1.18Γ10β38 and as large as 3.4Γ1038. FP16, with fewer exponent and mantissa bits, covers a smaller range: from about 6.1Γ10β5 to 6.5Γ104. BF16, thanks to its 8 exponent bits, matches FP32's range (Β 3.4Γ1038) but with lower precision due to only 7 mantissa bits.
-
The smallest positive normalized value in
minnormalβ=2(1βbias)=2(1β127)β1.18Γ10β38;FP32is calculated as: -
The largest value in
max=(2β2β10)Γ215β6.55Γ104;FP16is: -
Precision is determined by the mantissa. In
FP32, the smallest difference between two numbers near 1.0 is approximately 2β23β1.19Γ10β7; -
In
FP16, it is 2β10β9.77Γ10β4; -
BF16's smallest step is 2β7β7.81Γ10β3, much coarser thanFP32orFP16.
This means that while BF16 can represent very large or very small numbers, it cannot distinguish between values as finely as FP32 or even FP16.
In floating-point arithmetic, underflow occurs when a number is too small in magnitude to be represented as a normalized value, resulting in zero or a subnormal number. Overflow happens when a number is too large to be represented, resulting in infinity. Subnormal numbers (also called denormals) fill the gap between zero and the smallest normalized value, allowing for gradual loss of precision instead of an abrupt jump to zero.
Choosing a floating-point format for neural network computations involves a trade-off between stability and precision. More mantissa bits (as in FP32) allow for higher precision, reducing round-off errors and improving numerical stability during training, especially in deep or sensitive models. Fewer mantissa bits (as in FP16 or BF16) increase the risk of rounding errors, which can accumulate and destabilize learning, but they reduce memory and computational costs.
The number of exponent bits determines the dynamic range. FP32 and BF16, with 8 exponent bits, can represent much larger and smaller numbers than FP16, making them better suited for models with large activations or gradients. However, BF16's low mantissa precision means that, while it can avoid overflow and underflow, it may still lose fine detail in calculations.
Bit allocation in each format thus directly influences both the stability of training and the efficiency of inference. Understanding these trade-offs is essential for selecting the right format for a given neural network workload.
Thanks for your feedback!