Aprenda Fixed-Point and Integer Quantization | Numerical Foundations of Quantization

Deslize para mostrar o menu

When you switch from floating-point to fixed-point or integer representations in neural network quantization, you fundamentally change how numbers are stored and computed. In fixed-point and integer formats such as INT8 and INT4, values are represented using a fixed number of bits. For INT8, each number is stored in 8 bits, while INT4 uses only 4 bits. The bit allocation directly determines the range of representable values:

Signed INT8 can encode values from -128 to 127;
Unsigned INT8 covers 0 to 255;
Signed INT4 gives a range of -8 to 7;
Unsigned INT4 covers 0 to 15.

This limited range means you can only represent a small set of possible numbers, so careful mapping from real values is crucial.

To bridge the gap between the wide range of real numbers and the limited set of integers, quantization uses scaling factors and zero-points. The scaling factor, often denoted as $s$ , stretches or shrinks the integer range to cover the range of real values needed. The zero-point, $z$ , shifts the integer range so that real zero can be exactly represented, which is especially important for asymmetric quantization. The mapping from a real value $r$ to a quantized integer $q$ can be described by the equation:

q = \text{round}(r / s) + z

To recover the real value from a quantized integer, you use:

r ≈ s * (q - z)

This process lets you approximate real numbers using only integer arithmetic, which is much faster and more efficient on many hardware platforms.

Definition

Symmetric quantization uses a zero-point of zero, so the integer range is centered around zero and the mapping is simply $q = \text{round}(r / s) + z$ . Asymmetric quantization allows the zero-point to be nonzero, so the integer range can be shifted to better align with the distribution of real values, especially when zero is not centered in the data. Symmetric quantization is mathematically simpler, but asymmetric quantization can reduce error if the real value range is not symmetric around zero.

Choosing between INT8 and INT4 quantization brings important trade-offs. INT8 offers a wider range and finer precision, allowing you to represent values more accurately and with less quantization error. It is generally easier to implement and less likely to degrade model accuracy, but it uses more memory and computational resources than INT4. INT4, on the other hand, dramatically reduces storage and computation costs, but at the expense of a much narrower range and greater risk of quantization error. INT4 quantization is best suited for models or layers where the value distribution can be tightly bounded and small errors are acceptable. The decision depends on your accuracy requirements, hardware constraints, and tolerance for implementation complexity.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 2

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 2