Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Vanishing and Exploding Gradients | Foundations of Neural Network Optimization
Optimization and Regularization in Neural Networks with Python

bookVanishing and Exploding Gradients

Before you can train deep neural networks effectively, you must understand the challenges posed by vanishing and exploding gradients. These phenomena arise during backpropagation, when gradients are calculated layer by layer from the output back to the input. If the gradients become extremely small, they are said to vanish; if they grow uncontrollably large, they explode. Both situations can severely hinder the training of deep models. Vanishing gradients make weight updates negligible, causing earlier layers to learn very slowly or not at all. Exploding gradients, on the other hand, can cause unstable updates that lead to numerical overflow or erratic training dynamics.

12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as np import matplotlib.pyplot as plt # Simulate a deep network: a stack of linear layers with a simple activation np.random.seed(42) depth = 20 # Number of layers input_dim = 1 output_dim = 1 x = np.array([[1.0]]) # Single input scales = [0.5, 1.0, 1.5] gradients_per_scale = [] for scale in scales: w = [np.random.randn(input_dim, output_dim) * scale for _ in range(depth)] a = x.copy() activations = [a] # Forward pass for i in range(depth): a = np.tanh(a @ w[i]) activations.append(a) # Backward pass (track gradient at each layer) grad = np.ones_like(activations[-1]) grads = [np.linalg.norm(grad)] # Store gradient magnitude at output for i in reversed(range(depth)): grad = grad @ w[i].T grad = grad * (1 - activations[i] ** 2) # Derivative of tanh grads.append(np.linalg.norm(grad)) gradients_per_scale.append(grads[::-1]) # Reverse to match layer order # Plot gradient magnitudes for each scale plt.figure(figsize=(8, 5)) for grads, scale in zip(gradients_per_scale, scales): plt.plot(range(depth + 1), grads, marker='o', label=f'Weight scale: {scale}') plt.yscale('log') plt.xlabel('Layer') plt.ylabel('Gradient magnitude (log scale)') plt.title('Gradient Magnitude Across Layers') plt.legend() plt.grid(True, which='both', linestyle='--', alpha=0.5) plt.tight_layout() plt.show()
copy

The simulation above demonstrates how the choice of weight scale affects gradient propagation in a deep network. When weights are too small, repeated multiplication and activation derivatives tend to shrink the gradients exponentially as they move backward, leading to vanishing gradients. When weights are too large, gradients can grow rapidly and explode. Both effects become more pronounced as the network depth increases.

To address these issues, you can use several mitigation strategies:

  • Choose careful weight initialization schemes, such as Xavier or He initialization, to keep gradients in a healthy range;
  • Apply normalization techniques like batch normalization to stabilize activations and gradients;
  • Use activation functions less prone to extreme derivatives, such as ReLU, which can help maintain gradient flow;
  • Limit network depth or use architectural innovations such as residual connections to ease gradient propagation.

Understanding and addressing vanishing and exploding gradients is essential for building and training deep neural networks that learn effectively.

question mark

Which of the following best describes the effect of vanishing gradients during deep neural network training?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 3

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookVanishing and Exploding Gradients

Swipe to show menu

Before you can train deep neural networks effectively, you must understand the challenges posed by vanishing and exploding gradients. These phenomena arise during backpropagation, when gradients are calculated layer by layer from the output back to the input. If the gradients become extremely small, they are said to vanish; if they grow uncontrollably large, they explode. Both situations can severely hinder the training of deep models. Vanishing gradients make weight updates negligible, causing earlier layers to learn very slowly or not at all. Exploding gradients, on the other hand, can cause unstable updates that lead to numerical overflow or erratic training dynamics.

12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as np import matplotlib.pyplot as plt # Simulate a deep network: a stack of linear layers with a simple activation np.random.seed(42) depth = 20 # Number of layers input_dim = 1 output_dim = 1 x = np.array([[1.0]]) # Single input scales = [0.5, 1.0, 1.5] gradients_per_scale = [] for scale in scales: w = [np.random.randn(input_dim, output_dim) * scale for _ in range(depth)] a = x.copy() activations = [a] # Forward pass for i in range(depth): a = np.tanh(a @ w[i]) activations.append(a) # Backward pass (track gradient at each layer) grad = np.ones_like(activations[-1]) grads = [np.linalg.norm(grad)] # Store gradient magnitude at output for i in reversed(range(depth)): grad = grad @ w[i].T grad = grad * (1 - activations[i] ** 2) # Derivative of tanh grads.append(np.linalg.norm(grad)) gradients_per_scale.append(grads[::-1]) # Reverse to match layer order # Plot gradient magnitudes for each scale plt.figure(figsize=(8, 5)) for grads, scale in zip(gradients_per_scale, scales): plt.plot(range(depth + 1), grads, marker='o', label=f'Weight scale: {scale}') plt.yscale('log') plt.xlabel('Layer') plt.ylabel('Gradient magnitude (log scale)') plt.title('Gradient Magnitude Across Layers') plt.legend() plt.grid(True, which='both', linestyle='--', alpha=0.5) plt.tight_layout() plt.show()
copy

The simulation above demonstrates how the choice of weight scale affects gradient propagation in a deep network. When weights are too small, repeated multiplication and activation derivatives tend to shrink the gradients exponentially as they move backward, leading to vanishing gradients. When weights are too large, gradients can grow rapidly and explode. Both effects become more pronounced as the network depth increases.

To address these issues, you can use several mitigation strategies:

  • Choose careful weight initialization schemes, such as Xavier or He initialization, to keep gradients in a healthy range;
  • Apply normalization techniques like batch normalization to stabilize activations and gradients;
  • Use activation functions less prone to extreme derivatives, such as ReLU, which can help maintain gradient flow;
  • Limit network depth or use architectural innovations such as residual connections to ease gradient propagation.

Understanding and addressing vanishing and exploding gradients is essential for building and training deep neural networks that learn effectively.

question mark

Which of the following best describes the effect of vanishing gradients during deep neural network training?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 3
some-alt