Encoder and Decoder Blocks Explained
Sveip for å vise menyen
Both encoder and decoder blocks are built from the same basic ingredients – self-attention, feed-forward layers, residual connections, and layer normalization – but they arrange them differently to serve distinct roles.
Encoder Block
The encoder reads the full input sequence and builds a rich contextual representation of it. Each block applies the following sublayers in order:
- Multi-head self-attention – every token attends to all others, gathering context from the entire sequence;
- Add & layer norm – the attention output is added to the input (residual connection) and normalized;
- Feed-forward network – a two-layer MLP applied independently to each position;
- Add & layer norm – residual connection and normalization after the feed-forward sublayer.
Decoder Block
The decoder generates the output sequence one token at a time, conditioned on the encoder's output. It extends the encoder block with two key changes:
- Masked multi-head self-attention – each token can only attend to earlier positions in the output, preventing the model from seeing future tokens during training;
- Add & layer norm;
- Cross-attention – the decoder attends to the encoder's output, letting it focus on relevant parts of the input sequence;
- Add & layer norm;
- Feed-forward network;
- Add & layer norm.
Key Differences
The encoder uses only self-attention over the full input — every token can see every other token. The decoder adds two mechanisms on top: masked self-attention to enforce left-to-right generation, and cross-attention to bridge the input and output sequences.
This structure is what makes the transformer effective for tasks like translation: the encoder builds a full understanding of the source sentence, and the decoder generates the target sentence while attending to that understanding at each step.
Comparing the two, encoder blocks only use self-attention, allowing each token to aggregate information from the entire input sequence, which is essential for understanding context. Decoder blocks, on the other hand, use both masked self-attention and cross-attention. Masked self-attention restricts the flow of information so that, during training, predictions for a position cannot depend on future positions. Cross-attention in the decoder enables it to focus on relevant parts of the encoder's output, effectively bridging the input and output sequences for tasks like translation.
Takk for tilbakemeldingene dine!
Spør AI
Spør AI
Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår