Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Vanishing and Exploding Gradients | Foundations of Neural Network Optimization
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Optimization and Regularization in Neural Networks with Python

bookVanishing and Exploding Gradients

Before you can train deep neural networks effectively, you must understand the challenges posed by vanishing and exploding gradients. These phenomena arise during backpropagation, when gradients are calculated layer by layer from the output back to the input. If the gradients become extremely small, they are said to vanish; if they grow uncontrollably large, they explode. Both situations can severely hinder the training of deep models. Vanishing gradients make weight updates negligible, causing earlier layers to learn very slowly or not at all. Exploding gradients, on the other hand, can cause unstable updates that lead to numerical overflow or erratic training dynamics.

12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as np import matplotlib.pyplot as plt # Simulate a deep network: a stack of linear layers with a simple activation np.random.seed(42) depth = 20 # Number of layers input_dim = 1 output_dim = 1 x = np.array([[1.0]]) # Single input scales = [0.5, 1.0, 1.5] gradients_per_scale = [] for scale in scales: w = [np.random.randn(input_dim, output_dim) * scale for _ in range(depth)] a = x.copy() activations = [a] # Forward pass for i in range(depth): a = np.tanh(a @ w[i]) activations.append(a) # Backward pass (track gradient at each layer) grad = np.ones_like(activations[-1]) grads = [np.linalg.norm(grad)] # Store gradient magnitude at output for i in reversed(range(depth)): grad = grad @ w[i].T grad = grad * (1 - activations[i] ** 2) # Derivative of tanh grads.append(np.linalg.norm(grad)) gradients_per_scale.append(grads[::-1]) # Reverse to match layer order # Plot gradient magnitudes for each scale plt.figure(figsize=(8, 5)) for grads, scale in zip(gradients_per_scale, scales): plt.plot(range(depth + 1), grads, marker='o', label=f'Weight scale: {scale}') plt.yscale('log') plt.xlabel('Layer') plt.ylabel('Gradient magnitude (log scale)') plt.title('Gradient Magnitude Across Layers') plt.legend() plt.grid(True, which='both', linestyle='--', alpha=0.5) plt.tight_layout() plt.show()
copy

The simulation above demonstrates how the choice of weight scale affects gradient propagation in a deep network. When weights are too small, repeated multiplication and activation derivatives tend to shrink the gradients exponentially as they move backward, leading to vanishing gradients. When weights are too large, gradients can grow rapidly and explode. Both effects become more pronounced as the network depth increases.

To address these issues, you can use several mitigation strategies:

  • Choose careful weight initialization schemes, such as Xavier or He initialization, to keep gradients in a healthy range;
  • Apply normalization techniques like batch normalization to stabilize activations and gradients;
  • Use activation functions less prone to extreme derivatives, such as ReLU, which can help maintain gradient flow;
  • Limit network depth or use architectural innovations such as residual connections to ease gradient propagation.

Understanding and addressing vanishing and exploding gradients is essential for building and training deep neural networks that learn effectively.

question mark

Which of the following best describes the effect of vanishing gradients during deep neural network training?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 3

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Suggested prompts:

Can you explain how Xavier and He initialization help with vanishing and exploding gradients?

What are some examples of architectures that use residual connections?

How does batch normalization stabilize gradients during training?

bookVanishing and Exploding Gradients

Scorri per mostrare il menu

Before you can train deep neural networks effectively, you must understand the challenges posed by vanishing and exploding gradients. These phenomena arise during backpropagation, when gradients are calculated layer by layer from the output back to the input. If the gradients become extremely small, they are said to vanish; if they grow uncontrollably large, they explode. Both situations can severely hinder the training of deep models. Vanishing gradients make weight updates negligible, causing earlier layers to learn very slowly or not at all. Exploding gradients, on the other hand, can cause unstable updates that lead to numerical overflow or erratic training dynamics.

12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as np import matplotlib.pyplot as plt # Simulate a deep network: a stack of linear layers with a simple activation np.random.seed(42) depth = 20 # Number of layers input_dim = 1 output_dim = 1 x = np.array([[1.0]]) # Single input scales = [0.5, 1.0, 1.5] gradients_per_scale = [] for scale in scales: w = [np.random.randn(input_dim, output_dim) * scale for _ in range(depth)] a = x.copy() activations = [a] # Forward pass for i in range(depth): a = np.tanh(a @ w[i]) activations.append(a) # Backward pass (track gradient at each layer) grad = np.ones_like(activations[-1]) grads = [np.linalg.norm(grad)] # Store gradient magnitude at output for i in reversed(range(depth)): grad = grad @ w[i].T grad = grad * (1 - activations[i] ** 2) # Derivative of tanh grads.append(np.linalg.norm(grad)) gradients_per_scale.append(grads[::-1]) # Reverse to match layer order # Plot gradient magnitudes for each scale plt.figure(figsize=(8, 5)) for grads, scale in zip(gradients_per_scale, scales): plt.plot(range(depth + 1), grads, marker='o', label=f'Weight scale: {scale}') plt.yscale('log') plt.xlabel('Layer') plt.ylabel('Gradient magnitude (log scale)') plt.title('Gradient Magnitude Across Layers') plt.legend() plt.grid(True, which='both', linestyle='--', alpha=0.5) plt.tight_layout() plt.show()
copy

The simulation above demonstrates how the choice of weight scale affects gradient propagation in a deep network. When weights are too small, repeated multiplication and activation derivatives tend to shrink the gradients exponentially as they move backward, leading to vanishing gradients. When weights are too large, gradients can grow rapidly and explode. Both effects become more pronounced as the network depth increases.

To address these issues, you can use several mitigation strategies:

  • Choose careful weight initialization schemes, such as Xavier or He initialization, to keep gradients in a healthy range;
  • Apply normalization techniques like batch normalization to stabilize activations and gradients;
  • Use activation functions less prone to extreme derivatives, such as ReLU, which can help maintain gradient flow;
  • Limit network depth or use architectural innovations such as residual connections to ease gradient propagation.

Understanding and addressing vanishing and exploding gradients is essential for building and training deep neural networks that learn effectively.

question mark

Which of the following best describes the effect of vanishing gradients during deep neural network training?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 3
some-alt