Summary  
This chapter explains how self-attention computes contextualized representations by relating each element within a single sequence to every other element, and how cross-attention links two sequences by using one sequence’s representations to selectively focus on relevant parts of the other.  

General domain of usage  
Machine translation

Understanding the distinction between **self-attention** and **cross-attention** is crucial for grasping how transformers process and relate information.

### Self-Attention
- Operates within a single sequence;
- Allows each position to access and weigh information from all other positions in that sequence;
- Enables the model to build rich, contextualized representations by dynamically focusing on relevant content, regardless of its position;
- Example: In a sentence, self-attention helps the model understand that pronouns like `it` might refer to a noun that appeared earlier.

### Cross-Attention
- Connects two different sequences;
- Commonly used between the output of an encoder (such as a processed input sentence) and the decoder steps in tasks like machine translation;
- At each decoder step, queries the encoded input, using cross-attention to selectively focus on the most relevant parts of the input sequence for generating the next output token.

Self-attention is used when a model needs to relate different positions within the same sequence, such as understanding dependencies in a sentence. Cross-attention is employed when a model must relate information from two different sequences, such as mapping an input sentence to an output sentence in translation. Transformers use self-attention throughout both the encoder and decoder, but cross-attention appears only in the decoder, enabling it to access and integrate encoder outputs.

Note

To make these mechanisms more concrete, consider the flow of information in each. With **self-attention**, imagine processing the sentence `"The cat sat on the mat."` At each word position, the model computes attention scores with every other word in the sentence, allowing it to weigh and combine information from across the sequence. This enables understanding of context and relationships such as subject-verb agreement or resolving references.

For **cross-attention**, picture a translation task where the model has already encoded the English sentence `"The cat sat on the mat."` As the decoder generates a translation in another language, at each step it uses cross-attention to examine all encoder outputs. This allows the decoder to decide which parts of the input sentence are most relevant for producing the next word in the translation, ensuring that the output is both accurate and contextually appropriate.

Which statement best describes the main difference between self-attention and cross-attention in transformers?

En omfattende, fullstendig teoretisk utforskning av oppmerksomhetsmekanismer i moderne nevrale arkitekturer. Dette kurset bygger intuisjon, matematisk forståelse og konseptuell klarhet rundt oppmerksomhet, selvoppmerksomhet, multi-head oppmerksomhet, maskering og deres rolle i transformere — uten programmering eller kode.

Explore the origins, motivation, and core mathematical ideas behind attention mechanisms, focusing on intuition, queries, keys, values, and scoring.

Delve into self-attention, its mathematical formulation, the intuition behind scaling and softmax, and the conceptual power of multi-head attention and masking.

Explore how attention mechanisms are integrated into transformer architectures, their conceptual placement, and their impact on reasoning and interpretability.

Self-Attention vs Cross-Attention: Conceptual Differences

Self-Attention

Cross-Attention