Learn How Self-Attention Is Calculated | Understanding Transformer Foundations

Swipe to show menu

To understand how self-attention is calculated in transformers, you need to follow a precise sequence of mathematical operations. The self-attention mechanism allows each word in a sentence to focus on other words when producing an output representation. This is achieved by computing a weighted sum of all the input vectors, where the weights reflect the importance of each word to the current word being processed.

Suppose you have a short sentence represented as a matrix, where each row is a word embedding. You first project these embeddings into queries, keys, and values using learned weight matrices. The core of self-attention is then the computation of attention scores and the aggregation of value vectors based on these scores.

Definition

Queries, keys, and values are different representations of the input, each used for a specific role in the attention calculation.

Self-Attention Calculation

You can break down self-attention into a series of clear mathematical steps for each word in the sequence:

Project inputs to queries, keys, and values: project the input matrix using learned weight matrices to obtain queries, keys, and values.


              123456789101112131415161718192021222324252627282930313233
            
import numpy as np

# Example input: 3 words, embedding size 4
X = np.array([
    [1.0, 0.0, 1.0, 0.0],   # word 1
    [0.0, 2.0, 0.0, 2.0],   # word 2
    [1.0, 1.0, 1.0, 1.0]    # word 3
])

# Weight matrices for queries, keys, values (embedding size 4 -> 4)
W_q = np.array([
    [0.1, 0.2, 0.0, 0.0],
    [0.0, 0.1, 0.3, 0.0],
    [0.1, 0.0, 0.0, 0.2],
    [0.0, 0.0, 0.2, 0.3]
])
W_k = np.array([
    [0.2, 0.0, 0.1, 0.0],
    [0.0, 0.1, 0.0, 0.3],
    [0.1, 0.0, 0.2, 0.0],
    [0.0, 0.2, 0.0, 0.1]
])
W_v = np.array([
    [0.0, 0.1, 0.0, 0.2],
    [0.2, 0.0, 0.2, 0.0],
    [0.0, 0.3, 0.1, 0.0],
    [0.1, 0.0, 0.0, 0.3]
])

# Compute queries, keys, values
Q = X @ W_q
K = X @ W_k
V = X @ W_v

Compute attention scores:

Calculate the similarity between each query and all keys;
These scores indicate how much focus each word should place on the others.


              1
            
scores = Q @ K.T

Scale the scores: divide the attention scores by the square root of the key dimension. This scaling helps stabilize the gradients during training.


              12
            
d_k = Q.shape[1]
scaled_scores = scores / np.sqrt(d_k)

Apply the softmax function: convert the scaled scores into attention weights using the softmax function. This ensures the weights are positive and sum to 1 for each word.


              123456
            
# Softmax to get attention weights
def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / np.sum(e_x, axis=-1, keepdims=True)

attention_weights = softmax(scaled_scores)

Compute the weighted sum of values: use the attention weights to calculate a weighted sum of the value vectors. This produces the final self-attention output for each word.


              12345
            
# Weighted sum of values
output = attention_weights @ V

print("Attention weights:\n", attention_weights)
print("Self-attention output:\n", output)

These steps capture the essential computations behind self-attention in transformer models.

1. What is the correct order of operations in self-attention for one input word?

2. Why do we scale attention scores by the square root of the key dimension?

3. Which vectors are used to compute attention scores

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 5