Impara Encoder and Decoder Blocks Explained

Scorri per mostrare il menu

Both encoder and decoder blocks are built from the same basic ingredients – self-attention, feed-forward layers, residual connections, and layer normalization – but they arrange them differently to serve distinct roles.

Encoder Block

The encoder reads the full input sequence and builds a rich contextual representation of it. Each block applies the following sublayers in order:

Multi-head self-attention – every token attends to all others, gathering context from the entire sequence;
Add & layer norm – the attention output is added to the input (residual connection) and normalized;
Feed-forward network – a two-layer MLP applied independently to each position;
Add & layer norm – residual connection and normalization after the feed-forward sublayer.

Decoder Block

The decoder generates the output sequence one token at a time, conditioned on the encoder's output. It extends the encoder block with two key changes:

Masked multi-head self-attention – each token can only attend to earlier positions in the output, preventing the model from seeing future tokens during training;
Add & layer norm;
Cross-attention – the decoder attends to the encoder's output, letting it focus on relevant parts of the input sequence;
Add & layer norm;
Feed-forward network;
Add & layer norm.

Key Differences

The encoder uses only self-attention over the full input — every token can see every other token. The decoder adds two mechanisms on top: masked self-attention to enforce left-to-right generation, and cross-attention to bridge the input and output sequences.

This structure is what makes the transformer effective for tasks like translation: the encoder builds a full understanding of the source sentence, and the decoder generates the target sentence while attending to that understanding at each step.

Comparing the two, encoder blocks only use self-attention, allowing each token to aggregate information from the entire input sequence, which is essential for understanding context. Decoder blocks, on the other hand, use both masked self-attention and cross-attention. Masked self-attention restricts the flow of information so that, during training, predictions for a position cannot depend on future positions. Cross-attention in the decoder enables it to focus on relevant parts of the encoder's output, effectively bridging the input and output sequences for tasks like translation.

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 8

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 8