Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Gradient Descent: The Core Optimizer | Foundations of Neural Network Optimization
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Optimization and Regularization in Neural Networks with Python

bookGradient Descent: The Core Optimizer

At the heart of neural network optimization lies gradient descent, a method designed to minimize a loss function by iteratively updating the network's parameters. The mathematical foundation of gradient descent is rooted in calculus: you compute the gradient (or derivative) of the loss with respect to each parameter, which points in the direction of steepest ascent. By moving in the opposite direction of this gradient, you descend toward a local minimum. The update rule for each parameter ww at iteration tt is:

wt+1=wtηLwtw_{t+1} = w_t - \eta \cdot \frac{\partial L}{\partial w_t}

Here, wtw_t is the current parameter value, η\eta is the learning rate, and Lwt\frac{\raisebox{1pt}{$\partial L$}}{\raisebox{-1pt}{$\partial w_t$}} is the gradient of the loss LL with respect to wtw_t. The learning rate controls the step size for each update, determining how far you move in the direction opposite the gradient. This process is repeated until the loss function converges to a minimum, ideally resulting in a well-trained neural network.

12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as np import matplotlib.pyplot as plt # Generate toy data: y = 2x + 1 with noise np.random.seed(42) X = np.linspace(0, 1, 100) y = 2 * X + 1 + np.random.normal(0, 0.1, size=X.shape) learning_rates = [0.01, 0.1, 0.5] epochs = 100 loss_histories = [] labels = [] for lr in learning_rates: # Initialize weights and bias for each run w = np.random.randn() b = np.random.randn() loss_history = [] for epoch in range(epochs): y_pred = w * X + b loss = np.mean((y - y_pred) ** 2) loss_history.append(loss) dw = -2 * np.mean((y - y_pred) * X) db = -2 * np.mean(y - y_pred) w = w - lr * dw b = b - lr * db loss_histories.append(loss_history) labels.append(f"learning rate = {lr}") # Plot training loss for each learning rate plt.figure(figsize=(8, 4)) for loss_history, label in zip(loss_histories, labels): plt.plot(loss_history, label=label) plt.title('Training Loss for Different Learning Rates') plt.xlabel('Epoch') plt.ylabel('Mean Squared Error Loss') plt.legend() plt.grid(True) plt.tight_layout() plt.show()
copy

The learning rate is a critical hyperparameter in gradient descent. If you set the learning rate too high, you risk overshooting the minimum, causing the loss to diverge or oscillate without settling. If it is too low, convergence becomes painfully slow, and you may get stuck in local minima or plateaus. By experimenting with the code sample above, you can observe how modifying the learning_rate variable affects the speed and stability of optimization. A moderate learning rate ensures smooth and efficient convergence, while extreme values can lead to suboptimal or unstable training outcomes.

question mark

Why is the choice of learning rate so important in gradient descent?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookGradient Descent: The Core Optimizer

Swipe to show menu

At the heart of neural network optimization lies gradient descent, a method designed to minimize a loss function by iteratively updating the network's parameters. The mathematical foundation of gradient descent is rooted in calculus: you compute the gradient (or derivative) of the loss with respect to each parameter, which points in the direction of steepest ascent. By moving in the opposite direction of this gradient, you descend toward a local minimum. The update rule for each parameter ww at iteration tt is:

wt+1=wtηLwtw_{t+1} = w_t - \eta \cdot \frac{\partial L}{\partial w_t}

Here, wtw_t is the current parameter value, η\eta is the learning rate, and Lwt\frac{\raisebox{1pt}{$\partial L$}}{\raisebox{-1pt}{$\partial w_t$}} is the gradient of the loss LL with respect to wtw_t. The learning rate controls the step size for each update, determining how far you move in the direction opposite the gradient. This process is repeated until the loss function converges to a minimum, ideally resulting in a well-trained neural network.

12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as np import matplotlib.pyplot as plt # Generate toy data: y = 2x + 1 with noise np.random.seed(42) X = np.linspace(0, 1, 100) y = 2 * X + 1 + np.random.normal(0, 0.1, size=X.shape) learning_rates = [0.01, 0.1, 0.5] epochs = 100 loss_histories = [] labels = [] for lr in learning_rates: # Initialize weights and bias for each run w = np.random.randn() b = np.random.randn() loss_history = [] for epoch in range(epochs): y_pred = w * X + b loss = np.mean((y - y_pred) ** 2) loss_history.append(loss) dw = -2 * np.mean((y - y_pred) * X) db = -2 * np.mean(y - y_pred) w = w - lr * dw b = b - lr * db loss_histories.append(loss_history) labels.append(f"learning rate = {lr}") # Plot training loss for each learning rate plt.figure(figsize=(8, 4)) for loss_history, label in zip(loss_histories, labels): plt.plot(loss_history, label=label) plt.title('Training Loss for Different Learning Rates') plt.xlabel('Epoch') plt.ylabel('Mean Squared Error Loss') plt.legend() plt.grid(True) plt.tight_layout() plt.show()
copy

The learning rate is a critical hyperparameter in gradient descent. If you set the learning rate too high, you risk overshooting the minimum, causing the loss to diverge or oscillate without settling. If it is too low, convergence becomes painfully slow, and you may get stuck in local minima or plateaus. By experimenting with the code sample above, you can observe how modifying the learning_rate variable affects the speed and stability of optimization. A moderate learning rate ensures smooth and efficient convergence, while extreme values can lead to suboptimal or unstable training outcomes.

question mark

Why is the choice of learning rate so important in gradient descent?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 2
some-alt