Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Stochastic Gradient Descent and Momentum | Optimization Algorithms in Practice
Optimization and Regularization in Neural Networks with Python

bookStochastic Gradient Descent and Momentum

Stochastic gradient descent (SGD) is a widely used method for optimizing neural networks, building upon the foundation of basic gradient descent. In standard gradient descent, you update the model parameters by computing the gradient of the loss function with respect to all training data, then moving in the direction that reduces the loss. However, this can be computationally expensive for large datasets. SGD addresses this by updating parameters using the gradient from a single data point or a small batch, introducing randomness into each update. This allows for faster iterations and can help the optimizer escape shallow local minima.

Momentum is an extension of SGD that introduces a memory term to the optimization process. Instead of updating parameters solely based on the current gradient, momentum accumulates an exponentially decaying moving average of past gradients. Mathematically, while vanilla SGD updates parameters as:

ΞΈ=ΞΈβˆ’Ξ·β‹…βˆ‡L(ΞΈ)\theta = \theta - \eta \cdot \nabla L(\theta)

where ΞΈ\theta is the parameter vector, Ξ·\eta is the learning rate, and βˆ‡L(ΞΈ)\nabla L(\theta) is the gradient of the loss, SGD with momentum updates as:

v=ΞΌβ‹…vβˆ’Ξ·β‹…βˆ‡L(ΞΈ)ΞΈ=ΞΈ+vv = \mu \cdot v - \eta \cdot \nabla L(\theta)\\ \theta = \theta + v

Here, vv is the velocity (the accumulated gradient), and ΞΌ\mu is the momentum coefficient (commonly set between 0.5 and 0.9). This approach helps the optimizer maintain direction in valleys and dampen oscillations, resulting in faster and more stable convergence, especially on complex loss surfaces.

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import torch import matplotlib.pyplot as plt # Simple quadratic function: f(x) = (x - 3)^2 + 4 def func(x): return (x - 3) ** 2 + 4 def grad(x): return 2 * (x - 3) # SGD optimizer def sgd(start, lr, steps): x = torch.tensor([start], dtype=torch.float32) history = [x.item()] for _ in range(steps): g = grad(x) x = x - lr * g history.append(x.item()) return history # SGD with Momentum optimizer def sgd_momentum(start, lr, steps, momentum=0.9): x = torch.tensor([start], dtype=torch.float32) v = torch.tensor([0.0], dtype=torch.float32) history = [x.item()] for _ in range(steps): g = grad(x) v = momentum * v - lr * g x = x + v history.append(x.item()) return history # Run both optimizers start_point = -1.0 lr = 0.07 steps = 15 sgd_hist = sgd(start_point, lr, steps) momentum_hist = sgd_momentum(start_point, lr, steps, momentum=0.8) # Prepare data for plotting x_vals = torch.linspace(-1, 5, 100) y_vals = func(x_vals) sgd_y = [func(torch.tensor([x])) for x in sgd_hist] momentum_y = [func(torch.tensor([x])) for x in momentum_hist] # Create two subplots for better visibility fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # First subplot: SGD axes[0].plot(x_vals, y_vals, label="Loss surface: f(x)", color="gray") axes[0].plot(sgd_hist, sgd_y, marker='o', color="blue", label="SGD Path") axes[0].set_xlabel("x") axes[0].set_ylabel("f(x)") axes[0].set_title("SGD on Quadratic Function") axes[0].legend() # Second subplot: SGD with Momentum axes[1].plot(x_vals, y_vals, label="Loss surface: f(x)", color="gray") axes[1].plot(momentum_hist, momentum_y, marker='x', color="red", label="SGD with Momentum Path") axes[1].set_xlabel("x") axes[1].set_ylabel("f(x)") axes[1].set_title("SGD with Momentum on Quadratic Function") axes[1].legend() plt.tight_layout() plt.show()
copy

When comparing the convergence behavior of SGD and SGD with momentum using the quadratic function example, you can observe that vanilla SGD takes steady but sometimes inefficient steps directly against the gradient. This can lead to slow progress, especially if the loss surface has ravines or plateaus. In contrast, introducing momentum allows the optimizer to "remember" previous gradients and build up velocity in directions of consistent descent. This results in faster movement toward the minimum and helps to smooth out oscillations that can occur when gradients point in varying directions. The plotted trajectories reveal that momentum accelerates convergence and leads to a more direct path to the optimal value, while standard SGD may zigzag or take smaller, less efficient steps.

question mark

How does adding momentum to stochastic gradient descent change the optimization trajectory compared to vanilla SGD?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the main differences between SGD and SGD with momentum in simple terms?

How does the momentum parameter affect the optimization process?

Can you provide more examples of when momentum is especially useful?

bookStochastic Gradient Descent and Momentum

Swipe to show menu

Stochastic gradient descent (SGD) is a widely used method for optimizing neural networks, building upon the foundation of basic gradient descent. In standard gradient descent, you update the model parameters by computing the gradient of the loss function with respect to all training data, then moving in the direction that reduces the loss. However, this can be computationally expensive for large datasets. SGD addresses this by updating parameters using the gradient from a single data point or a small batch, introducing randomness into each update. This allows for faster iterations and can help the optimizer escape shallow local minima.

Momentum is an extension of SGD that introduces a memory term to the optimization process. Instead of updating parameters solely based on the current gradient, momentum accumulates an exponentially decaying moving average of past gradients. Mathematically, while vanilla SGD updates parameters as:

ΞΈ=ΞΈβˆ’Ξ·β‹…βˆ‡L(ΞΈ)\theta = \theta - \eta \cdot \nabla L(\theta)

where ΞΈ\theta is the parameter vector, Ξ·\eta is the learning rate, and βˆ‡L(ΞΈ)\nabla L(\theta) is the gradient of the loss, SGD with momentum updates as:

v=ΞΌβ‹…vβˆ’Ξ·β‹…βˆ‡L(ΞΈ)ΞΈ=ΞΈ+vv = \mu \cdot v - \eta \cdot \nabla L(\theta)\\ \theta = \theta + v

Here, vv is the velocity (the accumulated gradient), and ΞΌ\mu is the momentum coefficient (commonly set between 0.5 and 0.9). This approach helps the optimizer maintain direction in valleys and dampen oscillations, resulting in faster and more stable convergence, especially on complex loss surfaces.

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import torch import matplotlib.pyplot as plt # Simple quadratic function: f(x) = (x - 3)^2 + 4 def func(x): return (x - 3) ** 2 + 4 def grad(x): return 2 * (x - 3) # SGD optimizer def sgd(start, lr, steps): x = torch.tensor([start], dtype=torch.float32) history = [x.item()] for _ in range(steps): g = grad(x) x = x - lr * g history.append(x.item()) return history # SGD with Momentum optimizer def sgd_momentum(start, lr, steps, momentum=0.9): x = torch.tensor([start], dtype=torch.float32) v = torch.tensor([0.0], dtype=torch.float32) history = [x.item()] for _ in range(steps): g = grad(x) v = momentum * v - lr * g x = x + v history.append(x.item()) return history # Run both optimizers start_point = -1.0 lr = 0.07 steps = 15 sgd_hist = sgd(start_point, lr, steps) momentum_hist = sgd_momentum(start_point, lr, steps, momentum=0.8) # Prepare data for plotting x_vals = torch.linspace(-1, 5, 100) y_vals = func(x_vals) sgd_y = [func(torch.tensor([x])) for x in sgd_hist] momentum_y = [func(torch.tensor([x])) for x in momentum_hist] # Create two subplots for better visibility fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # First subplot: SGD axes[0].plot(x_vals, y_vals, label="Loss surface: f(x)", color="gray") axes[0].plot(sgd_hist, sgd_y, marker='o', color="blue", label="SGD Path") axes[0].set_xlabel("x") axes[0].set_ylabel("f(x)") axes[0].set_title("SGD on Quadratic Function") axes[0].legend() # Second subplot: SGD with Momentum axes[1].plot(x_vals, y_vals, label="Loss surface: f(x)", color="gray") axes[1].plot(momentum_hist, momentum_y, marker='x', color="red", label="SGD with Momentum Path") axes[1].set_xlabel("x") axes[1].set_ylabel("f(x)") axes[1].set_title("SGD with Momentum on Quadratic Function") axes[1].legend() plt.tight_layout() plt.show()
copy

When comparing the convergence behavior of SGD and SGD with momentum using the quadratic function example, you can observe that vanilla SGD takes steady but sometimes inefficient steps directly against the gradient. This can lead to slow progress, especially if the loss surface has ravines or plateaus. In contrast, introducing momentum allows the optimizer to "remember" previous gradients and build up velocity in directions of consistent descent. This results in faster movement toward the minimum and helps to smooth out oscillations that can occur when gradients point in varying directions. The plotted trajectories reveal that momentum accelerates convergence and leads to a more direct path to the optimal value, while standard SGD may zigzag or take smaller, less efficient steps.

question mark

How does adding momentum to stochastic gradient descent change the optimization trajectory compared to vanilla SGD?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 1
some-alt