Encoder and Decoder Blocks Explained
Scorri per mostrare il menu
Both encoder and decoder blocks are built from the same basic ingredients – self-attention, feed-forward layers, residual connections, and layer normalization – but they arrange them differently to serve distinct roles.
Encoder Block
The encoder reads the full input sequence and builds a rich contextual representation of it. Each block applies the following sublayers in order:
- Multi-head self-attention – every token attends to all others, gathering context from the entire sequence;
- Add & layer norm – the attention output is added to the input (residual connection) and normalized;
- Feed-forward network – a two-layer MLP applied independently to each position;
- Add & layer norm – residual connection and normalization after the feed-forward sublayer.
Decoder Block
The decoder generates the output sequence one token at a time, conditioned on the encoder's output. It extends the encoder block with two key changes:
- Masked multi-head self-attention – each token can only attend to earlier positions in the output, preventing the model from seeing future tokens during training;
- Add & layer norm;
- Cross-attention – the decoder attends to the encoder's output, letting it focus on relevant parts of the input sequence;
- Add & layer norm;
- Feed-forward network;
- Add & layer norm.
Key Differences
The encoder uses only self-attention over the full input — every token can see every other token. The decoder adds two mechanisms on top: masked self-attention to enforce left-to-right generation, and cross-attention to bridge the input and output sequences.
This structure is what makes the transformer effective for tasks like translation: the encoder builds a full understanding of the source sentence, and the decoder generates the target sentence while attending to that understanding at each step.
Comparing the two, encoder blocks only use self-attention, allowing each token to aggregate information from the entire input sequence, which is essential for understanding context. Decoder blocks, on the other hand, use both masked self-attention and cross-attention. Masked self-attention restricts the flow of information so that, during training, predictions for a position cannot depend on future positions. Cross-attention in the decoder enables it to focus on relevant parts of the encoder's output, effectively bridging the input and output sequences for tasks like translation.
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione