Learn Self-Attention Mechanism | Foundations of Transformer Architecture

Swipe to show menu

The self-attention mechanism is a core component of transformer models, allowing each token in a sequence to dynamically focus on other tokens when building its contextual representation. At the heart of self-attention are three vectors associated with each token: the query, key, and value. Each token generates its own query, key, and value vectors through learned linear transformations. The interaction among these vectors determines how much attention each token pays to every other token in the sequence.

Here's how the process unfolds: for every token, you compare its query vector to the key vectors of all tokens (including itself) using a similarity measure, typically a dot product. Mathematically, for a token at position $i$ and a token at position $j$:

\text{score}_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j^T

This produces a set of scores that indicate the relevance of each token to the current token. These scores are then normalized, usually with a softmax function, to obtain attention weights — numbers between 0 and 1 that sum to 1:

\alpha_{ij} = \text{softmax}(\text{score}_{ij}) = \frac{\exp(\text{score}_{ij})}{\sum_{l=1}^n \exp(\text{score}_{il})}

These weights dictate how much each value vector contributes to the final output representation for the given token. The output is a weighted sum of all value vectors, where the weights reflect the contextual importance as determined by the attention mechanism:

\mathbf{z}_i = \sum_{j=1}^n \alpha_{ij} \mathbf{v}_j

Definition

Multi-head attention is a technique where the self-attention mechanism is executed in parallel multiple times, with each "head" using its own set of learned projection matrices. This allows the model to capture different types of relationships and dependencies in the data simultaneously, as each head can focus on different aspects or subspaces of the input.

Benefits of Self-Attention

Enables direct modeling of dependencies between any pair of tokens, regardless of their distance in the sequence;
Allows for parallel computation across all tokens, increasing efficiency compared to sequential models;
Adapts context dynamically, letting each token attend to the most relevant information for the task.

Limitations of Self-Attention

Computational and memory requirements grow quadratically with sequence length, making it challenging for very long inputs;
May struggle to encode absolute or relative positional information unless supplemented by additional mechanisms;
Can sometimes overfit to local patterns if not properly regularized or diversified.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 1