Learn Vanishing and Exploding Gradients | Foundations of Neural Network Optimization

Swipe to show menu

Before you can train deep neural networks effectively, you must understand the challenges posed by vanishing and exploding gradients. These phenomena arise during backpropagation, when gradients are calculated layer by layer from the output back to the input. If the gradients become extremely small, they are said to vanish; if they grow uncontrollably large, they explode. Both situations can severely hinder the training of deep models. Vanishing gradients make weight updates negligible, causing earlier layers to learn very slowly or not at all. Exploding gradients, on the other hand, can cause unstable updates that lead to numerical overflow or erratic training dynamics.


              12345678910111213141516171819202122232425262728293031323334353637383940414243
            
import numpy as np
import matplotlib.pyplot as plt

# Simulate a deep network: a stack of linear layers with a simple activation
np.random.seed(42)

depth = 20  # Number of layers
input_dim = 1
output_dim = 1
x = np.array([[1.0]])  # Single input

scales = [0.5, 1.0, 1.5]
gradients_per_scale = []

for scale in scales:
    w = [np.random.randn(input_dim, output_dim) * scale for _ in range(depth)]
    a = x.copy()
    activations = [a]
    # Forward pass
    for i in range(depth):
        a = np.tanh(a @ w[i])
        activations.append(a)
    # Backward pass (track gradient at each layer)
    grad = np.ones_like(activations[-1])
    grads = [np.linalg.norm(grad)]  # Store gradient magnitude at output
    for i in reversed(range(depth)):
        grad = grad @ w[i].T
        grad = grad * (1 - activations[i] ** 2)  # Derivative of tanh
        grads.append(np.linalg.norm(grad))
    gradients_per_scale.append(grads[::-1])  # Reverse to match layer order

# Plot gradient magnitudes for each scale
plt.figure(figsize=(8, 5))
for grads, scale in zip(gradients_per_scale, scales):
    plt.plot(range(depth + 1), grads, marker='o', label=f'Weight scale: {scale}')
plt.yscale('log')
plt.xlabel('Layer')
plt.ylabel('Gradient magnitude (log scale)')
plt.title('Gradient Magnitude Across Layers')
plt.legend()
plt.grid(True, which='both', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

The simulation above demonstrates how the choice of weight scale affects gradient propagation in a deep network. When weights are too small, repeated multiplication and activation derivatives tend to shrink the gradients exponentially as they move backward, leading to vanishing gradients. When weights are too large, gradients can grow rapidly and explode. Both effects become more pronounced as the network depth increases.

To address these issues, you can use several mitigation strategies:

Choose careful weight initialization schemes, such as Xavier or He initialization, to keep gradients in a healthy range;
Apply normalization techniques like batch normalization to stabilize activations and gradients;
Use activation functions less prone to extreme derivatives, such as ReLU, which can help maintain gradient flow;
Limit network depth or use architectural innovations such as residual connections to ease gradient propagation.

Understanding and addressing vanishing and exploding gradients is essential for building and training deep neural networks that learn effectively.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 3