Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn How Self-Attention Is Calculated | Understanding Transformer Foundations
Transformers for Natural Language Processing

bookHow Self-Attention Is Calculated

Swipe to show menu

To understand how self-attention is calculated in transformers, you need to follow a precise sequence of mathematical operations. The self-attention mechanism allows each word in a sentence to focus on other words when producing an output representation. This is achieved by computing a weighted sum of all the input vectors, where the weights reflect the importance of each word to the current word being processed.

Suppose you have a short sentence represented as a matrix, where each row is a word embedding. You first project these embeddings into queries, keys, and values using learned weight matrices. The core of self-attention is then the computation of attention scores and the aggregation of value vectors based on these scores.

Note
Definition

Queries, keys, and values are different representations of the input, each used for a specific role in the attention calculation.

Self-Attention Calculation

You can break down self-attention into a series of clear mathematical steps for each word in the sequence:

Project inputs to queries, keys, and values: project the input matrix using learned weight matrices to obtain queries, keys, and values.

123456789101112131415161718192021222324252627282930313233
import numpy as np # Example input: 3 words, embedding size 4 X = np.array([ [1.0, 0.0, 1.0, 0.0], # word 1 [0.0, 2.0, 0.0, 2.0], # word 2 [1.0, 1.0, 1.0, 1.0] # word 3 ]) # Weight matrices for queries, keys, values (embedding size 4 -> 4) W_q = np.array([ [0.1, 0.2, 0.0, 0.0], [0.0, 0.1, 0.3, 0.0], [0.1, 0.0, 0.0, 0.2], [0.0, 0.0, 0.2, 0.3] ]) W_k = np.array([ [0.2, 0.0, 0.1, 0.0], [0.0, 0.1, 0.0, 0.3], [0.1, 0.0, 0.2, 0.0], [0.0, 0.2, 0.0, 0.1] ]) W_v = np.array([ [0.0, 0.1, 0.0, 0.2], [0.2, 0.0, 0.2, 0.0], [0.0, 0.3, 0.1, 0.0], [0.1, 0.0, 0.0, 0.3] ]) # Compute queries, keys, values Q = X @ W_q K = X @ W_k V = X @ W_v
copy

Compute attention scores:

  • Calculate the similarity between each query and all keys;
  • These scores indicate how much focus each word should place on the others.
1
scores = Q @ K.T
copy

Scale the scores: divide the attention scores by the square root of the key dimension. This scaling helps stabilize the gradients during training.

12
d_k = Q.shape[1] scaled_scores = scores / np.sqrt(d_k)
copy

Apply the softmax function: convert the scaled scores into attention weights using the softmax function. This ensures the weights are positive and sum to 1 for each word.

123456
# Softmax to get attention weights def softmax(x): e_x = np.exp(x - np.max(x, axis=-1, keepdims=True)) return e_x / np.sum(e_x, axis=-1, keepdims=True) attention_weights = softmax(scaled_scores)
copy

Compute the weighted sum of values: use the attention weights to calculate a weighted sum of the value vectors. This produces the final self-attention output for each word.

12345
# Weighted sum of values output = attention_weights @ V print("Attention weights:\n", attention_weights) print("Self-attention output:\n", output)
copy

These steps capture the essential computations behind self-attention in transformer models.

1. What is the correct order of operations in self-attention for one input word?

2. Why do we scale attention scores by the square root of the key dimension?

3. Which vectors are used to compute attention scores

question mark

What is the correct order of operations in self-attention for one input word?

Select the correct answer

question mark

Why do we scale attention scores by the square root of the key dimension?

Select the correct answer

question mark

Which vectors are used to compute attention scores

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 5

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 5
some-alt