Learn Gradient Descent: The Core Optimizer | Foundations of Neural Network Optimization

Swipe to show menu

At the heart of neural network optimization lies gradient descent, a method designed to minimize a loss function by iteratively updating the network's parameters. The mathematical foundation of gradient descent is rooted in calculus: you compute the gradient (or derivative) of the loss with respect to each parameter, which points in the direction of steepest ascent. By moving in the opposite direction of this gradient, you descend toward a local minimum. The update rule for each parameter $w$ at iteration $t$ is:

w_{t+1} = w_t - \eta \cdot \frac{\partial L}{\partial w_t}

Here, $w_t$ is the current parameter value, $\eta$ is the learning rate, and $\frac{\raisebox{1pt}{$\partial L$}}{\raisebox{-1pt}{$\partial w_t$}}$ is the gradient of the loss $L$ with respect to $w_t$ . The learning rate controls the step size for each update, determining how far you move in the direction opposite the gradient. This process is repeated until the loss function converges to a minimum, ideally resulting in a well-trained neural network.


              12345678910111213141516171819202122232425262728293031323334353637383940
            
import numpy as np
import matplotlib.pyplot as plt

# Generate toy data: y = 2x + 1 with noise
np.random.seed(42)
X = np.linspace(0, 1, 100)
y = 2 * X + 1 + np.random.normal(0, 0.1, size=X.shape)

learning_rates = [0.01, 0.1, 0.5]
epochs = 100
loss_histories = []
labels = []

for lr in learning_rates:
    # Initialize weights and bias for each run
    w = np.random.randn()
    b = np.random.randn()
    loss_history = []
    for epoch in range(epochs):
        y_pred = w * X + b
        loss = np.mean((y - y_pred) ** 2)
        loss_history.append(loss)
        dw = -2 * np.mean((y - y_pred) * X)
        db = -2 * np.mean(y - y_pred)
        w = w - lr * dw
        b = b - lr * db
    loss_histories.append(loss_history)
    labels.append(f"learning rate = {lr}")

# Plot training loss for each learning rate
plt.figure(figsize=(8, 4))
for loss_history, label in zip(loss_histories, labels):
    plt.plot(loss_history, label=label)
plt.title('Training Loss for Different Learning Rates')
plt.xlabel('Epoch')
plt.ylabel('Mean Squared Error Loss')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

The learning rate is a critical hyperparameter in gradient descent. If you set the learning rate too high, you risk overshooting the minimum, causing the loss to diverge or oscillate without settling. If it is too low, convergence becomes painfully slow, and you may get stuck in local minima or plateaus. By experimenting with the code sample above, you can observe how modifying the learning_rate variable affects the speed and stability of optimization. A moderate learning rate ensures smooth and efficient convergence, while extreme values can lead to suboptimal or unstable training outcomes.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 2