Impara Layer Normalization and Feed-Forward Sublayers

Scorri per mostrare il menu

Layer Normalization

Layer normalization stabilizes training by normalizing activations across the feature dimension of each individual sample — not across the batch. This keeps the distribution of activations consistent as data flows through stacked transformer layers, which is critical for stable gradients.

In a transformer block, layer norm is applied twice: once around the self-attention sublayer and once around the feed-forward sublayer. Depending on the implementation, it can be placed before the sublayer (pre-norm) or after it (post-norm). Modern transformers typically use pre-norm.

Feed-Forward Sublayer

The feed-forward sublayer is a two-layer MLP applied independently to each position in the sequence:

A linear projection from d_model to d_ff (typically d_ff = 4 × d_model);
A non-linear activation (ReLU or GELU);
A linear projection back to d_model.

The residual connection wraps the entire sublayer — the input is added to the output before passing to the next sublayer.

import torch
import torch.nn as nn

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x):
        # Pre-norm: normalize before the sublayer
        norm_x = self.norm(x)
        ff_output = self.linear2(self.relu(self.linear1(norm_x)))

        # Residual connection: add input to sublayer output
        return x + ff_output


ff = FeedForward(d_model=512, d_ff=2048)
x = torch.rand(2, 10, 512)
print(ff(x).shape)  # Expected: torch.Size([2, 10, 512])

Run this locally and experiment with different d_ff values — notice the output shape stays (batch, seq_len, d_model) regardless.

A feed-forward sublayer in a transformer block is a simple neural network applied independently to each position in the sequence. It usually consists of two linear transformations with a non-linear activation function in between, such as the ReLU function. To further stabilize the output, layer normalization is applied either before or after the feed-forward network, depending on the specific transformer implementation. Below is a code sample that demonstrates a basic feed-forward sublayer using PyTorch, with layer normalization included for stability.

import torch
import torch.nn as nn

class FeedForwardWithLayerNorm(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(d_ff, d_model)
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, x):
        # Apply layer normalization before the feed-forward network
        norm_x = self.layer_norm(x)
        ff_output = self.linear2(self.relu(self.linear1(norm_x)))
        # Add residual connection
        return x + ff_output

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 9

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 9