Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Masking in Attention: Causal and Padding Masks | Self-Attention and Multi-Head Attention
Attention Mechanisms Explained

bookMasking in Attention: Causal and Padding Masks

Masking in attention mechanisms controls which parts of a sequence the model can focus on. You use masking to ensure the model only attends to relevant input and cannot access information it should not see. When processing sequences of different lengths, you add padding tokens so all sequences match in size for batch processing. If the model is not told to ignore these padding tokens, it might treat them as important, which harms performance. In autoregressive tasks like language generation, causal masks prevent the model from looking ahead at future tokens, stopping it from using information it should not have. Masking enforces these rules, helping the model focus on meaningful input and avoid information leakage.

Note
Definition

Causal masks and padding masks serve different purposes in attention. Causal masks are used in autoregressive models to block attention to future positions, ensuring that each position can only attend to itself and previous positions. Padding masks, on the other hand, are used to ignore padding tokens in variable-length sequences, making sure that attention scores for these positions do not influence the model's output.

Masks are applied to the raw attention scores before the softmax step. These scores measure how similar each query is to every key. The mask sets unwanted scores β€” like those for future tokens (causal mask) or padding tokens (padding mask) β€” to a very large negative number, such as negative infinity. When softmax is applied, these masked scores become zero probabilities. This means the model cannot attend to these positions. Causal masks block attention to future tokens, while padding masks ignore positions filled with padding, ensuring the model focuses only on valid input.

question mark

Which statement best describes the difference between causal masks and padding masks in attention mechanisms?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 4

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 10

bookMasking in Attention: Causal and Padding Masks

Swipe to show menu

Masking in attention mechanisms controls which parts of a sequence the model can focus on. You use masking to ensure the model only attends to relevant input and cannot access information it should not see. When processing sequences of different lengths, you add padding tokens so all sequences match in size for batch processing. If the model is not told to ignore these padding tokens, it might treat them as important, which harms performance. In autoregressive tasks like language generation, causal masks prevent the model from looking ahead at future tokens, stopping it from using information it should not have. Masking enforces these rules, helping the model focus on meaningful input and avoid information leakage.

Note
Definition

Causal masks and padding masks serve different purposes in attention. Causal masks are used in autoregressive models to block attention to future positions, ensuring that each position can only attend to itself and previous positions. Padding masks, on the other hand, are used to ignore padding tokens in variable-length sequences, making sure that attention scores for these positions do not influence the model's output.

Masks are applied to the raw attention scores before the softmax step. These scores measure how similar each query is to every key. The mask sets unwanted scores β€” like those for future tokens (causal mask) or padding tokens (padding mask) β€” to a very large negative number, such as negative infinity. When softmax is applied, these masked scores become zero probabilities. This means the model cannot attend to these positions. Causal masks block attention to future tokens, while padding masks ignore positions filled with padding, ensuring the model focuses only on valid input.

question mark

Which statement best describes the difference between causal masks and padding masks in attention mechanisms?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 4
some-alt