Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Mathematics of Scaled Dot-Product Attention | Section
Transformer Architecture

bookMathematics of Scaled Dot-Product Attention

Swipe to show menu

Queries, Keys, and Values

Scaled dot-product attention operates on three vectors derived from each input token: a query (Q), a key (K), and a value (V). Each is produced by multiplying the input by a learned weight matrix.

  • Q – represents what the current token is looking for;
  • K – represents what each token has to offer;
  • V – holds the actual information to be aggregated.

During attention, queries are compared against keys to compute relevance scores. Those scores then determine how much of each value to include in the output.

The Formula

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Each step breaks down as follows:

  1. Dot product QKTQKᵀ – computes a raw score for how well each query matches each key;
  2. Scale by dk\sqrt{dₖ} – prevents scores from growing large when the key dimension is high, which would push softmax into regions with very small gradients;
  3. Softmax – normalizes the scores into attention weights that sum to 1;
  4. Multiply by VV – produces a weighted sum of value vectors, one output per query.

Implementation in PyTorch

1234567891011121314151617181920212223242526
import torch import torch.nn.functional as F import math def scaled_dot_product_attention(Q, K, V): d_k = Q.size(-1) # Computing raw scores scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k) # Converting scores to attention weights weights = F.softmax(scores, dim=-1) # Aggregating value vectors output = weights @ V return output, weights # Sequence of 4 tokens, each projected into dimension 8 Q = torch.rand(4, 8) K = torch.rand(4, 8) V = torch.rand(4, 8) output, weights = scaled_dot_product_attention(Q, K, V) print("Attention weights:\n", weights) print("Output shape:", output.shape)
copy

Run this locally to observe how the attention weights distribute across tokens and how the output shape relates to the input.

question mark

Which of the following best explains why the dot product scores are divided by the square root of the key dimension (dk\sqrt{dₖ}) in scaled dot-product attention?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 2
some-alt