Aprende Mathematics of Scaled Dot-Product Attention

Desliza para mostrar el menú

Queries, Keys, and Values

Scaled dot-product attention operates on three vectors derived from each input token: a query (Q), a key (K), and a value (V). Each is produced by multiplying the input by a learned weight matrix.

Q – represents what the current token is looking for;
K – represents what each token has to offer;
V – holds the actual information to be aggregated.

During attention, queries are compared against keys to compute relevance scores. Those scores then determine how much of each value to include in the output.

The Formula

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Each step breaks down as follows:

Dot product $QKᵀ$ – computes a raw score for how well each query matches each key;
Scale by $\sqrt{dₖ}$ – prevents scores from growing large when the key dimension is high, which would push softmax into regions with very small gradients;
Softmax – normalizes the scores into attention weights that sum to 1;
Multiply by $V$ – produces a weighted sum of value vectors, one output per query.

Implementation in PyTorch


              1234567891011121314151617181920212223242526
            
import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)

    # Computing raw scores
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)

    # Converting scores to attention weights
    weights = F.softmax(scores, dim=-1)

    # Aggregating value vectors
    output = weights @ V
    return output, weights


# Sequence of 4 tokens, each projected into dimension 8
Q = torch.rand(4, 8)
K = torch.rand(4, 8)
V = torch.rand(4, 8)

output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights:\n", weights)
print("Output shape:", output.shape)

Run this locally to observe how the attention weights distribute across tokens and how the output shape relates to the input.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 1. Capítulo 2

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 1. Capítulo 2