Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Mathematics of Scaled Dot-Product Attention | Section
Transformer Architecture

bookMathematics of Scaled Dot-Product Attention

Svep för att visa menyn

Queries, Keys, and Values

Scaled dot-product attention operates on three vectors derived from each input token: a query (Q), a key (K), and a value (V). Each is produced by multiplying the input by a learned weight matrix.

  • Q – represents what the current token is looking for;
  • K – represents what each token has to offer;
  • V – holds the actual information to be aggregated.

During attention, queries are compared against keys to compute relevance scores. Those scores then determine how much of each value to include in the output.

The Formula

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Each step breaks down as follows:

  1. Dot product QKTQKᵀ – computes a raw score for how well each query matches each key;
  2. Scale by dk\sqrt{dₖ} – prevents scores from growing large when the key dimension is high, which would push softmax into regions with very small gradients;
  3. Softmax – normalizes the scores into attention weights that sum to 1;
  4. Multiply by VV – produces a weighted sum of value vectors, one output per query.

Implementation in PyTorch

1234567891011121314151617181920212223242526
import torch import torch.nn.functional as F import math def scaled_dot_product_attention(Q, K, V): d_k = Q.size(-1) # Computing raw scores scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k) # Converting scores to attention weights weights = F.softmax(scores, dim=-1) # Aggregating value vectors output = weights @ V return output, weights # Sequence of 4 tokens, each projected into dimension 8 Q = torch.rand(4, 8) K = torch.rand(4, 8) V = torch.rand(4, 8) output, weights = scaled_dot_product_attention(Q, K, V) print("Attention weights:\n", weights) print("Output shape:", output.shape)
copy

Run this locally to observe how the attention weights distribute across tokens and how the output shape relates to the input.

question mark

Which of the following best explains why the dot product scores are divided by the square root of the key dimension (dk\sqrt{dₖ}) in scaled dot-product attention?

Vänligen välj det korrekta svaret

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 2

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 2
some-alt