Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Fixed-Point and Integer Quantization | Numerical Foundations of Quantization
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Quantization Theory for Neural Networks

bookFixed-Point and Integer Quantization

When you switch from floating-point to fixed-point or integer representations in neural network quantization, you fundamentally change how numbers are stored and computed. In fixed-point and integer formats such as INT8 and INT4, values are represented using a fixed number of bits. For INT8, each number is stored in 8 bits, while INT4 uses only 4 bits. The bit allocation directly determines the range of representable values:

  • Signed INT8 can encode values from -128 to 127;
  • Unsigned INT8 covers 0 to 255;
  • Signed INT4 gives a range of -8 to 7;
  • Unsigned INT4 covers 0 to 15.

This limited range means you can only represent a small set of possible numbers, so careful mapping from real values is crucial.

To bridge the gap between the wide range of real numbers and the limited set of integers, quantization uses scaling factors and zero-points. The scaling factor, often denoted as ss, stretches or shrinks the integer range to cover the range of real values needed. The zero-point, zz, shifts the integer range so that real zero can be exactly represented, which is especially important for asymmetric quantization. The mapping from a real value rr to a quantized integer qq can be described by the equation:

q=round(r/s)+zq = \text{round}(r / s) + z

To recover the real value from a quantized integer, you use:

rs(qz)r ≈ s * (q - z)

This process lets you approximate real numbers using only integer arithmetic, which is much faster and more efficient on many hardware platforms.

Note
Definition

Symmetric quantization uses a zero-point of zero, so the integer range is centered around zero and the mapping is simply q=round(r/s)+zq = \text{round}(r / s) + z. Asymmetric quantization allows the zero-point to be nonzero, so the integer range can be shifted to better align with the distribution of real values, especially when zero is not centered in the data. Symmetric quantization is mathematically simpler, but asymmetric quantization can reduce error if the real value range is not symmetric around zero.

Choosing between INT8 and INT4 quantization brings important trade-offs. INT8 offers a wider range and finer precision, allowing you to represent values more accurately and with less quantization error. It is generally easier to implement and less likely to degrade model accuracy, but it uses more memory and computational resources than INT4. INT4, on the other hand, dramatically reduces storage and computation costs, but at the expense of a much narrower range and greater risk of quantization error. INT4 quantization is best suited for models or layers where the value distribution can be tightly bounded and small errors are acceptable. The decision depends on your accuracy requirements, hardware constraints, and tolerance for implementation complexity.

question mark

How does the choice of scaling factor affect quantization error?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 1. Capítulo 2

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Suggested prompts:

Can you explain more about how scaling factors and zero-points are chosen in practice?

What are some common challenges when implementing INT4 quantization?

How does quantization affect the accuracy of neural networks in real-world applications?

bookFixed-Point and Integer Quantization

Deslize para mostrar o menu

When you switch from floating-point to fixed-point or integer representations in neural network quantization, you fundamentally change how numbers are stored and computed. In fixed-point and integer formats such as INT8 and INT4, values are represented using a fixed number of bits. For INT8, each number is stored in 8 bits, while INT4 uses only 4 bits. The bit allocation directly determines the range of representable values:

  • Signed INT8 can encode values from -128 to 127;
  • Unsigned INT8 covers 0 to 255;
  • Signed INT4 gives a range of -8 to 7;
  • Unsigned INT4 covers 0 to 15.

This limited range means you can only represent a small set of possible numbers, so careful mapping from real values is crucial.

To bridge the gap between the wide range of real numbers and the limited set of integers, quantization uses scaling factors and zero-points. The scaling factor, often denoted as ss, stretches or shrinks the integer range to cover the range of real values needed. The zero-point, zz, shifts the integer range so that real zero can be exactly represented, which is especially important for asymmetric quantization. The mapping from a real value rr to a quantized integer qq can be described by the equation:

q=round(r/s)+zq = \text{round}(r / s) + z

To recover the real value from a quantized integer, you use:

rs(qz)r ≈ s * (q - z)

This process lets you approximate real numbers using only integer arithmetic, which is much faster and more efficient on many hardware platforms.

Note
Definition

Symmetric quantization uses a zero-point of zero, so the integer range is centered around zero and the mapping is simply q=round(r/s)+zq = \text{round}(r / s) + z. Asymmetric quantization allows the zero-point to be nonzero, so the integer range can be shifted to better align with the distribution of real values, especially when zero is not centered in the data. Symmetric quantization is mathematically simpler, but asymmetric quantization can reduce error if the real value range is not symmetric around zero.

Choosing between INT8 and INT4 quantization brings important trade-offs. INT8 offers a wider range and finer precision, allowing you to represent values more accurately and with less quantization error. It is generally easier to implement and less likely to degrade model accuracy, but it uses more memory and computational resources than INT4. INT4, on the other hand, dramatically reduces storage and computation costs, but at the expense of a much narrower range and greater risk of quantization error. INT4 quantization is best suited for models or layers where the value distribution can be tightly bounded and small errors are acceptable. The decision depends on your accuracy requirements, hardware constraints, and tolerance for implementation complexity.

question mark

How does the choice of scaling factor affect quantization error?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 1. Capítulo 2
some-alt