Layer Normalization and Feed-Forward Sublayers
メニューを表示するにはスワイプしてください
Layer Normalization
Layer normalization stabilizes training by normalizing activations across the feature dimension of each individual sample — not across the batch. This keeps the distribution of activations consistent as data flows through stacked transformer layers, which is critical for stable gradients.
In a transformer block, layer norm is applied twice: once around the self-attention sublayer and once around the feed-forward sublayer. Depending on the implementation, it can be placed before the sublayer (pre-norm) or after it (post-norm). Modern transformers typically use pre-norm.
Feed-Forward Sublayer
The feed-forward sublayer is a two-layer MLP applied independently to each position in the sequence:
- A linear projection from
d_modeltod_ff(typicallyd_ff = 4 × d_model); - A non-linear activation (
ReLUorGELU); - A linear projection back to
d_model.
The residual connection wraps the entire sublayer — the input is added to the output before passing to the next sublayer.
import torch
import torch.nn as nn
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.relu = nn.ReLU()
self.linear2 = nn.Linear(d_ff, d_model)
self.norm = nn.LayerNorm(d_model)
def forward(self, x):
# Pre-norm: normalize before the sublayer
norm_x = self.norm(x)
ff_output = self.linear2(self.relu(self.linear1(norm_x)))
# Residual connection: add input to sublayer output
return x + ff_output
ff = FeedForward(d_model=512, d_ff=2048)
x = torch.rand(2, 10, 512)
print(ff(x).shape) # Expected: torch.Size([2, 10, 512])
Run this locally and experiment with different d_ff values — notice the output shape stays (batch, seq_len, d_model) regardless.
A feed-forward sublayer in a transformer block is a simple neural network applied independently to each position in the sequence. It usually consists of two linear transformations with a non-linear activation function in between, such as the ReLU function. To further stabilize the output, layer normalization is applied either before or after the feed-forward network, depending on the specific transformer implementation. Below is a code sample that demonstrates a basic feed-forward sublayer using PyTorch, with layer normalization included for stability.
import torch
import torch.nn as nn
class FeedForwardWithLayerNorm(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.relu = nn.ReLU()
self.linear2 = nn.Linear(d_ff, d_model)
self.layer_norm = nn.LayerNorm(d_model)
def forward(self, x):
# Apply layer normalization before the feed-forward network
norm_x = self.layer_norm(x)
ff_output = self.linear2(self.relu(self.linear1(norm_x)))
# Add residual connection
return x + ff_output
フィードバックありがとうございます!
AIに質問する
AIに質問する
何でも質問するか、提案された質問の1つを試してチャットを始めてください