Vanishing and Exploding Gradients
Before you can train deep neural networks effectively, you must understand the challenges posed by vanishing and exploding gradients. These phenomena arise during backpropagation, when gradients are calculated layer by layer from the output back to the input. If the gradients become extremely small, they are said to vanish; if they grow uncontrollably large, they explode. Both situations can severely hinder the training of deep models. Vanishing gradients make weight updates negligible, causing earlier layers to learn very slowly or not at all. Exploding gradients, on the other hand, can cause unstable updates that lead to numerical overflow or erratic training dynamics.
12345678910111213141516171819202122232425262728293031323334353637383940414243import numpy as np import matplotlib.pyplot as plt # Simulate a deep network: a stack of linear layers with a simple activation np.random.seed(42) depth = 20 # Number of layers input_dim = 1 output_dim = 1 x = np.array([[1.0]]) # Single input scales = [0.5, 1.0, 1.5] gradients_per_scale = [] for scale in scales: w = [np.random.randn(input_dim, output_dim) * scale for _ in range(depth)] a = x.copy() activations = [a] # Forward pass for i in range(depth): a = np.tanh(a @ w[i]) activations.append(a) # Backward pass (track gradient at each layer) grad = np.ones_like(activations[-1]) grads = [np.linalg.norm(grad)] # Store gradient magnitude at output for i in reversed(range(depth)): grad = grad @ w[i].T grad = grad * (1 - activations[i] ** 2) # Derivative of tanh grads.append(np.linalg.norm(grad)) gradients_per_scale.append(grads[::-1]) # Reverse to match layer order # Plot gradient magnitudes for each scale plt.figure(figsize=(8, 5)) for grads, scale in zip(gradients_per_scale, scales): plt.plot(range(depth + 1), grads, marker='o', label=f'Weight scale: {scale}') plt.yscale('log') plt.xlabel('Layer') plt.ylabel('Gradient magnitude (log scale)') plt.title('Gradient Magnitude Across Layers') plt.legend() plt.grid(True, which='both', linestyle='--', alpha=0.5) plt.tight_layout() plt.show()
The simulation above demonstrates how the choice of weight scale affects gradient propagation in a deep network. When weights are too small, repeated multiplication and activation derivatives tend to shrink the gradients exponentially as they move backward, leading to vanishing gradients. When weights are too large, gradients can grow rapidly and explode. Both effects become more pronounced as the network depth increases.
To address these issues, you can use several mitigation strategies:
- Choose careful weight initialization schemes, such as
XavierorHeinitialization, to keep gradients in a healthy range; - Apply normalization techniques like
batch normalizationto stabilize activations and gradients; - Use activation functions less prone to extreme derivatives, such as
ReLU, which can help maintain gradient flow; - Limit network depth or use architectural innovations such as residual connections to ease gradient propagation.
Understanding and addressing vanishing and exploding gradients is essential for building and training deep neural networks that learn effectively.
Bedankt voor je feedback!
Vraag AI
Vraag AI
Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.
Can you explain how Xavier and He initialization help with vanishing and exploding gradients?
What are some examples of architectures that use residual connections?
How does batch normalization stabilize gradients during training?
Geweldig!
Completion tarief verbeterd naar 8.33
Vanishing and Exploding Gradients
Veeg om het menu te tonen
Before you can train deep neural networks effectively, you must understand the challenges posed by vanishing and exploding gradients. These phenomena arise during backpropagation, when gradients are calculated layer by layer from the output back to the input. If the gradients become extremely small, they are said to vanish; if they grow uncontrollably large, they explode. Both situations can severely hinder the training of deep models. Vanishing gradients make weight updates negligible, causing earlier layers to learn very slowly or not at all. Exploding gradients, on the other hand, can cause unstable updates that lead to numerical overflow or erratic training dynamics.
12345678910111213141516171819202122232425262728293031323334353637383940414243import numpy as np import matplotlib.pyplot as plt # Simulate a deep network: a stack of linear layers with a simple activation np.random.seed(42) depth = 20 # Number of layers input_dim = 1 output_dim = 1 x = np.array([[1.0]]) # Single input scales = [0.5, 1.0, 1.5] gradients_per_scale = [] for scale in scales: w = [np.random.randn(input_dim, output_dim) * scale for _ in range(depth)] a = x.copy() activations = [a] # Forward pass for i in range(depth): a = np.tanh(a @ w[i]) activations.append(a) # Backward pass (track gradient at each layer) grad = np.ones_like(activations[-1]) grads = [np.linalg.norm(grad)] # Store gradient magnitude at output for i in reversed(range(depth)): grad = grad @ w[i].T grad = grad * (1 - activations[i] ** 2) # Derivative of tanh grads.append(np.linalg.norm(grad)) gradients_per_scale.append(grads[::-1]) # Reverse to match layer order # Plot gradient magnitudes for each scale plt.figure(figsize=(8, 5)) for grads, scale in zip(gradients_per_scale, scales): plt.plot(range(depth + 1), grads, marker='o', label=f'Weight scale: {scale}') plt.yscale('log') plt.xlabel('Layer') plt.ylabel('Gradient magnitude (log scale)') plt.title('Gradient Magnitude Across Layers') plt.legend() plt.grid(True, which='both', linestyle='--', alpha=0.5) plt.tight_layout() plt.show()
The simulation above demonstrates how the choice of weight scale affects gradient propagation in a deep network. When weights are too small, repeated multiplication and activation derivatives tend to shrink the gradients exponentially as they move backward, leading to vanishing gradients. When weights are too large, gradients can grow rapidly and explode. Both effects become more pronounced as the network depth increases.
To address these issues, you can use several mitigation strategies:
- Choose careful weight initialization schemes, such as
XavierorHeinitialization, to keep gradients in a healthy range; - Apply normalization techniques like
batch normalizationto stabilize activations and gradients; - Use activation functions less prone to extreme derivatives, such as
ReLU, which can help maintain gradient flow; - Limit network depth or use architectural innovations such as residual connections to ease gradient propagation.
Understanding and addressing vanishing and exploding gradients is essential for building and training deep neural networks that learn effectively.
Bedankt voor je feedback!