Aprende Self-Attention: Contextual Representations from Within | Self-Attention and Multi-Head Attention

Understanding self-attention begins with the idea that each element in a sequence — such as each word in a sentence — can "look at" all the other elements, including itself, to gather information about its context. Instead of processing each element in isolation or only considering immediate neighbors, self-attention lets every element dynamically decide which other elements are most relevant for understanding its meaning within the sequence. This means that a word like bank in She sat by the river bank can attend to river and adjust its representation accordingly, making it context-aware. In self-attention, every position in the input sequence computes a new representation by considering a weighted combination of all positions in the sequence, allowing for flexible, context-sensitive understanding.

Mathematical View of Self-Attention

Mathematically, self-attention transforms a sequence of vectors (such as word embeddings) into a new sequence of context-aware vectors. For each position in the sequence, you compute a set of weights that determine how much attention should be paid to every other position. These weights come from similarity scores between the query of the current position and the keys of all positions.

The output at each position is a weighted sum of the value vectors of all positions. Formally:

The attention score between position $i$ and position $j$ is computed as the similarity between their query and key vectors;
The attention weights are obtained by applying a normalization function (typically softmax) to these scores across all $j$ for each $i$ ;
The new representation at position $i$ is produced by summing all value vectors, each multiplied by its corresponding attention weight.

This mechanism allows every output vector to incorporate information from the entire sequence, weighted by relevance as determined by the learned similarity function.

Formal Self-Attention Equation

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Note

While traditional attention mechanisms allow a model to focus on relevant parts of a different input (such as aligning words in a translation), self-attention applies attention within a single sequence. In self-attention, every element attends to all others in the same sequence, making it especially powerful for capturing relationships and dependencies in tasks like language modeling and sequence classification.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 2. Capítulo 1

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Suggested prompts:

Can you explain what queries, keys, and values are in self-attention?

How does the softmax function work in this context?

Can you give a simple example of self-attention with numbers?

Desliza para mostrar el menú