Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Self-Attention: Contextual Representations from Within | Self-Attention and Multi-Head Attention
Attention Mechanisms Explained

bookSelf-Attention: Contextual Representations from Within

Understanding self-attention begins with the idea that each element in a sequence — such as each word in a sentence — can "look at" all the other elements, including itself, to gather information about its context. Instead of processing each element in isolation or only considering immediate neighbors, self-attention lets every element dynamically decide which other elements are most relevant for understanding its meaning within the sequence. This means that a word like bank in She sat by the river bank can attend to river and adjust its representation accordingly, making it context-aware. In self-attention, every position in the input sequence computes a new representation by considering a weighted combination of all positions in the sequence, allowing for flexible, context-sensitive understanding.

Mathematical View of Self-Attention

Mathematically, self-attention transforms a sequence of vectors (such as word embeddings) into a new sequence of context-aware vectors. For each position in the sequence, you compute a set of weights that determine how much attention should be paid to every other position. These weights come from similarity scores between the query of the current position and the keys of all positions.

The output at each position is a weighted sum of the value vectors of all positions. Formally:

  • The attention score between position ii and position jj is computed as the similarity between their query and key vectors;
  • The attention weights are obtained by applying a normalization function (typically softmax) to these scores across all jj for each ii;
  • The new representation at position ii is produced by summing all value vectors, each multiplied by its corresponding attention weight.

This mechanism allows every output vector to incorporate information from the entire sequence, weighted by relevance as determined by the learned similarity function.

Formal Self-Attention Equation

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
Note
Note

While traditional attention mechanisms allow a model to focus on relevant parts of a different input (such as aligning words in a translation), self-attention applies attention within a single sequence. In self-attention, every element attends to all others in the same sequence, making it especially powerful for capturing relationships and dependencies in tasks like language modeling and sequence classification.

question mark

Which of the following best describes self-attention in neural networks?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 2. Kapitel 1

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Awesome!

Completion rate improved to 10

bookSelf-Attention: Contextual Representations from Within

Swipe um das Menü anzuzeigen

Understanding self-attention begins with the idea that each element in a sequence — such as each word in a sentence — can "look at" all the other elements, including itself, to gather information about its context. Instead of processing each element in isolation or only considering immediate neighbors, self-attention lets every element dynamically decide which other elements are most relevant for understanding its meaning within the sequence. This means that a word like bank in She sat by the river bank can attend to river and adjust its representation accordingly, making it context-aware. In self-attention, every position in the input sequence computes a new representation by considering a weighted combination of all positions in the sequence, allowing for flexible, context-sensitive understanding.

Mathematical View of Self-Attention

Mathematically, self-attention transforms a sequence of vectors (such as word embeddings) into a new sequence of context-aware vectors. For each position in the sequence, you compute a set of weights that determine how much attention should be paid to every other position. These weights come from similarity scores between the query of the current position and the keys of all positions.

The output at each position is a weighted sum of the value vectors of all positions. Formally:

  • The attention score between position ii and position jj is computed as the similarity between their query and key vectors;
  • The attention weights are obtained by applying a normalization function (typically softmax) to these scores across all jj for each ii;
  • The new representation at position ii is produced by summing all value vectors, each multiplied by its corresponding attention weight.

This mechanism allows every output vector to incorporate information from the entire sequence, weighted by relevance as determined by the learned similarity function.

Formal Self-Attention Equation

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
Note
Note

While traditional attention mechanisms allow a model to focus on relevant parts of a different input (such as aligning words in a translation), self-attention applies attention within a single sequence. In self-attention, every element attends to all others in the same sequence, making it especially powerful for capturing relationships and dependencies in tasks like language modeling and sequence classification.

question mark

Which of the following best describes self-attention in neural networks?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 2. Kapitel 1
some-alt