Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Apprendre Gradient Descent: The Core Optimizer | Foundations of Neural Network Optimization
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Optimization and Regularization in Neural Networks with Python

bookGradient Descent: The Core Optimizer

At the heart of neural network optimization lies gradient descent, a method designed to minimize a loss function by iteratively updating the network's parameters. The mathematical foundation of gradient descent is rooted in calculus: you compute the gradient (or derivative) of the loss with respect to each parameter, which points in the direction of steepest ascent. By moving in the opposite direction of this gradient, you descend toward a local minimum. The update rule for each parameter ww at iteration tt is:

wt+1=wtηLwtw_{t+1} = w_t - \eta \cdot \frac{\partial L}{\partial w_t}

Here, wtw_t is the current parameter value, η\eta is the learning rate, and Lwt\frac{\raisebox{1pt}{$\partial L$}}{\raisebox{-1pt}{$\partial w_t$}} is the gradient of the loss LL with respect to wtw_t. The learning rate controls the step size for each update, determining how far you move in the direction opposite the gradient. This process is repeated until the loss function converges to a minimum, ideally resulting in a well-trained neural network.

12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as np import matplotlib.pyplot as plt # Generate toy data: y = 2x + 1 with noise np.random.seed(42) X = np.linspace(0, 1, 100) y = 2 * X + 1 + np.random.normal(0, 0.1, size=X.shape) learning_rates = [0.01, 0.1, 0.5] epochs = 100 loss_histories = [] labels = [] for lr in learning_rates: # Initialize weights and bias for each run w = np.random.randn() b = np.random.randn() loss_history = [] for epoch in range(epochs): y_pred = w * X + b loss = np.mean((y - y_pred) ** 2) loss_history.append(loss) dw = -2 * np.mean((y - y_pred) * X) db = -2 * np.mean(y - y_pred) w = w - lr * dw b = b - lr * db loss_histories.append(loss_history) labels.append(f"learning rate = {lr}") # Plot training loss for each learning rate plt.figure(figsize=(8, 4)) for loss_history, label in zip(loss_histories, labels): plt.plot(loss_history, label=label) plt.title('Training Loss for Different Learning Rates') plt.xlabel('Epoch') plt.ylabel('Mean Squared Error Loss') plt.legend() plt.grid(True) plt.tight_layout() plt.show()
copy

The learning rate is a critical hyperparameter in gradient descent. If you set the learning rate too high, you risk overshooting the minimum, causing the loss to diverge or oscillate without settling. If it is too low, convergence becomes painfully slow, and you may get stuck in local minima or plateaus. By experimenting with the code sample above, you can observe how modifying the learning_rate variable affects the speed and stability of optimization. A moderate learning rate ensures smooth and efficient convergence, while extreme values can lead to suboptimal or unstable training outcomes.

question mark

Why is the choice of learning rate so important in gradient descent?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 1. Chapitre 2

Demandez à l'IA

expand

Demandez à l'IA

ChatGPT

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Suggested prompts:

Can you explain how the gradients are calculated in the code sample?

What happens if I use a learning rate outside the range shown in the example?

How do I choose an appropriate learning rate for my own neural network?

bookGradient Descent: The Core Optimizer

Glissez pour afficher le menu

At the heart of neural network optimization lies gradient descent, a method designed to minimize a loss function by iteratively updating the network's parameters. The mathematical foundation of gradient descent is rooted in calculus: you compute the gradient (or derivative) of the loss with respect to each parameter, which points in the direction of steepest ascent. By moving in the opposite direction of this gradient, you descend toward a local minimum. The update rule for each parameter ww at iteration tt is:

wt+1=wtηLwtw_{t+1} = w_t - \eta \cdot \frac{\partial L}{\partial w_t}

Here, wtw_t is the current parameter value, η\eta is the learning rate, and Lwt\frac{\raisebox{1pt}{$\partial L$}}{\raisebox{-1pt}{$\partial w_t$}} is the gradient of the loss LL with respect to wtw_t. The learning rate controls the step size for each update, determining how far you move in the direction opposite the gradient. This process is repeated until the loss function converges to a minimum, ideally resulting in a well-trained neural network.

12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as np import matplotlib.pyplot as plt # Generate toy data: y = 2x + 1 with noise np.random.seed(42) X = np.linspace(0, 1, 100) y = 2 * X + 1 + np.random.normal(0, 0.1, size=X.shape) learning_rates = [0.01, 0.1, 0.5] epochs = 100 loss_histories = [] labels = [] for lr in learning_rates: # Initialize weights and bias for each run w = np.random.randn() b = np.random.randn() loss_history = [] for epoch in range(epochs): y_pred = w * X + b loss = np.mean((y - y_pred) ** 2) loss_history.append(loss) dw = -2 * np.mean((y - y_pred) * X) db = -2 * np.mean(y - y_pred) w = w - lr * dw b = b - lr * db loss_histories.append(loss_history) labels.append(f"learning rate = {lr}") # Plot training loss for each learning rate plt.figure(figsize=(8, 4)) for loss_history, label in zip(loss_histories, labels): plt.plot(loss_history, label=label) plt.title('Training Loss for Different Learning Rates') plt.xlabel('Epoch') plt.ylabel('Mean Squared Error Loss') plt.legend() plt.grid(True) plt.tight_layout() plt.show()
copy

The learning rate is a critical hyperparameter in gradient descent. If you set the learning rate too high, you risk overshooting the minimum, causing the loss to diverge or oscillate without settling. If it is too low, convergence becomes painfully slow, and you may get stuck in local minima or plateaus. By experimenting with the code sample above, you can observe how modifying the learning_rate variable affects the speed and stability of optimization. A moderate learning rate ensures smooth and efficient convergence, while extreme values can lead to suboptimal or unstable training outcomes.

question mark

Why is the choice of learning rate so important in gradient descent?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 1. Chapitre 2
some-alt