Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Scaling and Softmax: Making Attention Work | Self-Attention and Multi-Head Attention
Attention Mechanisms Explained

bookScaling and Softmax: Making Attention Work

When you compute attention scores in self-attention, you take the dot product between a query vector and key vectors.

As the dimensionality of these vectors increases, the magnitude of the dot product can also grow. This happens because:

  • The dot product tends to increase in variance with higher dimensions;
  • If each component of the vectors is roughly independent and of similar scale, the sum of their products (the dot product) will have a variance proportional to the dimension.

As a result, unscaled dot products can produce very large or very small values. When these are passed through the softmax function, you can get:

  • Extremely peaked distributions (where one value dominates);
  • Very flat distributions (where all values are similar).

To address this, scale the dot product by dividing by the square root of the key dimension:

  • Mathematically, divide by dk\sqrt{d_k}, where dkd_k is the dimension of the key vectors.

This scaling:

  • Keeps the values within a suitable range for the softmax function;
  • Prevents extreme attention weights;
  • Makes the learning process more stable.

The softmax function is central to attention mechanisms because it converts raw attention scores into a probability distribution over the input positions. Mathematically, for a set of scores z1,z2,...,znz_1, z_2, ..., z_n, the softmax function for the ii-th score is defined as:

softmax(zi)=ezij=1nezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}

This equation ensures that each output is positive and that all outputs sum to 1, making them interpretable as probabilities. In the context of attention, after you compute the (scaled) dot products between the query and each key, you apply softmax to these scores. The result is a set of attention weights that reflect how much focus should be given to each input token, with higher weights indicating greater relevance.

Note
Note

Scaling the dot product before softmax and using the softmax function itself both play crucial roles in stabilizing training. Without scaling, large dot products can cause the softmax to produce extremely small gradients, which can slow down or even stall learning. Proper scaling keeps gradients in a healthy range, while softmax ensures smooth, differentiable outputs that allow effective gradient-based optimization.

Interpreting attention weights as probability distributions is one of the most powerful aspects of the attention mechanism. Because the softmax function produces outputs that are non-negative and sum to one, the resulting attention weights can be directly interpreted as the probability of attending to each input token. This not only helps the model focus on the most relevant parts of the input, but also provides transparency into what the model is "looking at" during each step of processing, making attention-based models more interpretable than many other neural network architectures.

question mark

Which of the following statements about scaling and softmax in attention mechanisms are correct?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 2

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Awesome!

Completion rate improved to 10

bookScaling and Softmax: Making Attention Work

Sveip for å vise menyen

When you compute attention scores in self-attention, you take the dot product between a query vector and key vectors.

As the dimensionality of these vectors increases, the magnitude of the dot product can also grow. This happens because:

  • The dot product tends to increase in variance with higher dimensions;
  • If each component of the vectors is roughly independent and of similar scale, the sum of their products (the dot product) will have a variance proportional to the dimension.

As a result, unscaled dot products can produce very large or very small values. When these are passed through the softmax function, you can get:

  • Extremely peaked distributions (where one value dominates);
  • Very flat distributions (where all values are similar).

To address this, scale the dot product by dividing by the square root of the key dimension:

  • Mathematically, divide by dk\sqrt{d_k}, where dkd_k is the dimension of the key vectors.

This scaling:

  • Keeps the values within a suitable range for the softmax function;
  • Prevents extreme attention weights;
  • Makes the learning process more stable.

The softmax function is central to attention mechanisms because it converts raw attention scores into a probability distribution over the input positions. Mathematically, for a set of scores z1,z2,...,znz_1, z_2, ..., z_n, the softmax function for the ii-th score is defined as:

softmax(zi)=ezij=1nezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}

This equation ensures that each output is positive and that all outputs sum to 1, making them interpretable as probabilities. In the context of attention, after you compute the (scaled) dot products between the query and each key, you apply softmax to these scores. The result is a set of attention weights that reflect how much focus should be given to each input token, with higher weights indicating greater relevance.

Note
Note

Scaling the dot product before softmax and using the softmax function itself both play crucial roles in stabilizing training. Without scaling, large dot products can cause the softmax to produce extremely small gradients, which can slow down or even stall learning. Proper scaling keeps gradients in a healthy range, while softmax ensures smooth, differentiable outputs that allow effective gradient-based optimization.

Interpreting attention weights as probability distributions is one of the most powerful aspects of the attention mechanism. Because the softmax function produces outputs that are non-negative and sum to one, the resulting attention weights can be directly interpreted as the probability of attending to each input token. This not only helps the model focus on the most relevant parts of the input, but also provides transparency into what the model is "looking at" during each step of processing, making attention-based models more interpretable than many other neural network architectures.

question mark

Which of the following statements about scaling and softmax in attention mechanisms are correct?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 2
some-alt