Summary  
This chapter covers implementing attention masks (causal and padding) by setting unwanted attention scores to a large negative value before softmax so the model cannot attend to future or padded tokens.

General domain of usage  
Natural Language Processing

**Masking** in attention mechanisms controls which parts of a sequence the model can focus on. You use masking to ensure the model only attends to relevant input and cannot access information it should not see. When processing sequences of different lengths, you add **padding tokens** so all sequences match in size for batch processing. If the model is not told to ignore these padding tokens, it might treat them as important, which harms performance. In autoregressive tasks like language generation, **causal masks** prevent the model from looking ahead at future tokens, stopping it from using information it should not have. Masking enforces these rules, helping the model focus on meaningful input and avoid information leakage.

Causal masks and padding masks serve different purposes in attention. **Causal masks** are used in autoregressive models to block attention to future positions, ensuring that each position can only attend to itself and previous positions. **Padding masks**, on the other hand, are used to ignore padding tokens in variable-length sequences, making sure that attention scores for these positions do not influence the model's output.

Definition

Masks are applied to the raw attention scores before the softmax step. These scores measure how similar each query is to every key. The mask sets unwanted scores — like those for future tokens (causal mask) or padding tokens (padding mask) — to a very large negative number, such as negative infinity. When softmax is applied, these masked scores become zero probabilities. This means the model cannot attend to these positions. Causal masks block attention to future tokens, while padding masks ignore positions filled with padding, ensuring the model focuses only on valid input.

Which statement best describes the difference between causal masks and padding masks in attention mechanisms?

En omfattende, fuldstændig teoretisk udforskning af opmærksomhedsmekanismer i moderne neurale arkitekturer. Dette kursus opbygger intuition, matematisk forståelse og konceptuel klarhed omkring opmærksomhed, selv-opmærksomhed, multi-head opmærksomhed, maskering og deres rolle i transformere — uden nogen programmering eller kode.

Explore the origins, motivation, and core mathematical ideas behind attention mechanisms, focusing on intuition, queries, keys, values, and scoring.

Delve into self-attention, its mathematical formulation, the intuition behind scaling and softmax, and the conceptual power of multi-head attention and masking.

Explore how attention mechanisms are integrated into transformer architectures, their conceptual placement, and their impact on reasoning and interpretability.

Masking in Attention: Causal and Padding Masks