Learn Mathematics of Scaled Dot-Product Attention

Swipe to show menu

Queries, Keys, and Values

Scaled dot-product attention operates on three vectors derived from each input token: a query (Q), a key (K), and a value (V). Each is produced by multiplying the input by a learned weight matrix.

Q – represents what the current token is looking for;
K – represents what each token has to offer;
V – holds the actual information to be aggregated.

During attention, queries are compared against keys to compute relevance scores. Those scores then determine how much of each value to include in the output.

The Formula

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Each step breaks down as follows:

Dot product $QKᵀ$ – computes a raw score for how well each query matches each key;
Scale by $\sqrt{dₖ}$ – prevents scores from growing large when the key dimension is high, which would push softmax into regions with very small gradients;
Softmax – normalizes the scores into attention weights that sum to 1;
Multiply by $V$ – produces a weighted sum of value vectors, one output per query.

Implementation in PyTorch


              1234567891011121314151617181920212223242526
            
import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)

    # Computing raw scores
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)

    # Converting scores to attention weights
    weights = F.softmax(scores, dim=-1)

    # Aggregating value vectors
    output = weights @ V
    return output, weights


# Sequence of 4 tokens, each projected into dimension 8
Q = torch.rand(4, 8)
K = torch.rand(4, 8)
V = torch.rand(4, 8)

output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights:\n", weights)
print("Output shape:", output.shape)

Run this locally to observe how the attention weights distribute across tokens and how the output shape relates to the input.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 2