Learn Masking in Attention: Causal and Padding Masks | Self-Attention and Multi-Head Attention

Masking in attention mechanisms controls which parts of a sequence the model can focus on. You use masking to ensure the model only attends to relevant input and cannot access information it should not see. When processing sequences of different lengths, you add padding tokens so all sequences match in size for batch processing. If the model is not told to ignore these padding tokens, it might treat them as important, which harms performance. In autoregressive tasks like language generation, causal masks prevent the model from looking ahead at future tokens, stopping it from using information it should not have. Masking enforces these rules, helping the model focus on meaningful input and avoid information leakage.

Definition

Causal masks and padding masks serve different purposes in attention. Causal masks are used in autoregressive models to block attention to future positions, ensuring that each position can only attend to itself and previous positions. Padding masks, on the other hand, are used to ignore padding tokens in variable-length sequences, making sure that attention scores for these positions do not influence the model's output.

Masks are applied to the raw attention scores before the softmax step. These scores measure how similar each query is to every key. The mask sets unwanted scores — like those for future tokens (causal mask) or padding tokens (padding mask) — to a very large negative number, such as negative infinity. When softmax is applied, these masked scores become zero probabilities. This means the model cannot attend to these positions. Causal masks block attention to future tokens, while padding masks ignore positions filled with padding, ensuring the model focuses only on valid input.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the difference between causal masks and padding masks in more detail?

Why is it important to mask padding tokens during training?

How does masking affect the model's performance?

Swipe to show menu

Definition

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 4