Learn Stochastic Gradient Descent and Momentum | Optimization Algorithms in Practice

Swipe to show menu

Stochastic gradient descent (SGD) is a widely used method for optimizing neural networks, building upon the foundation of basic gradient descent. In standard gradient descent, you update the model parameters by computing the gradient of the loss function with respect to all training data, then moving in the direction that reduces the loss. However, this can be computationally expensive for large datasets. SGD addresses this by updating parameters using the gradient from a single data point or a small batch, introducing randomness into each update. This allows for faster iterations and can help the optimizer escape shallow local minima.

Momentum is an extension of SGD that introduces a memory term to the optimization process. Instead of updating parameters solely based on the current gradient, momentum accumulates an exponentially decaying moving average of past gradients. Mathematically, while vanilla SGD updates parameters as:

\theta = \theta - \eta \cdot \nabla L(\theta)

where $\theta$ is the parameter vector, $\eta$ is the learning rate, and $\nabla L(\theta)$ is the gradient of the loss, SGD with momentum updates as:

v = \mu \cdot v - \eta \cdot \nabla L(\theta)\\ \theta = \theta + v

Here, $v$ is the velocity (the accumulated gradient), and $\mu$ is the momentum coefficient (commonly set between 0.5 and 0.9). This approach helps the optimizer maintain direction in valleys and dampen oscillations, resulting in faster and more stable convergence, especially on complex loss surfaces.


              12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
            
import torch
import matplotlib.pyplot as plt

# Simple quadratic function: f(x) = (x - 3)^2 + 4
def func(x):
    return (x - 3) ** 2 + 4

def grad(x):
    return 2 * (x - 3)

# SGD optimizer
def sgd(start, lr, steps):
    x = torch.tensor([start], dtype=torch.float32)
    history = [x.item()]
    for _ in range(steps):
        g = grad(x)
        x = x - lr * g
        history.append(x.item())
    return history

# SGD with Momentum optimizer
def sgd_momentum(start, lr, steps, momentum=0.9):
    x = torch.tensor([start], dtype=torch.float32)
    v = torch.tensor([0.0], dtype=torch.float32)
    history = [x.item()]
    for _ in range(steps):
        g = grad(x)
        v = momentum * v - lr * g
        x = x + v
        history.append(x.item())
    return history

# Run both optimizers
start_point = -1.0
lr = 0.07
steps = 15

sgd_hist = sgd(start_point, lr, steps)
momentum_hist = sgd_momentum(start_point, lr, steps, momentum=0.8)

# Prepare data for plotting
x_vals = torch.linspace(-1, 5, 100)
y_vals = func(x_vals)
sgd_y = [func(torch.tensor([x])) for x in sgd_hist]
momentum_y = [func(torch.tensor([x])) for x in momentum_hist]

# Create two subplots for better visibility
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# First subplot: SGD
axes[0].plot(x_vals, y_vals, label="Loss surface: f(x)", color="gray")
axes[0].plot(sgd_hist, sgd_y, marker='o', color="blue", label="SGD Path")
axes[0].set_xlabel("x")
axes[0].set_ylabel("f(x)")
axes[0].set_title("SGD on Quadratic Function")
axes[0].legend()

# Second subplot: SGD with Momentum
axes[1].plot(x_vals, y_vals, label="Loss surface: f(x)", color="gray")
axes[1].plot(momentum_hist, momentum_y, marker='x', color="red", label="SGD with Momentum Path")
axes[1].set_xlabel("x")
axes[1].set_ylabel("f(x)")
axes[1].set_title("SGD with Momentum on Quadratic Function")
axes[1].legend()

plt.tight_layout()
plt.show()

When comparing the convergence behavior of SGD and SGD with momentum using the quadratic function example, you can observe that vanilla SGD takes steady but sometimes inefficient steps directly against the gradient. This can lead to slow progress, especially if the loss surface has ravines or plateaus. In contrast, introducing momentum allows the optimizer to "remember" previous gradients and build up velocity in directions of consistent descent. This results in faster movement toward the minimum and helps to smooth out oscillations that can occur when gradients point in varying directions. The plotted trajectories reveal that momentum accelerates convergence and leads to a more direct path to the optimal value, while standard SGD may zigzag or take smaller, less efficient steps.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 1