Stochastic Gradient Descent and Momentum
Stochastic gradient descent (SGD) is a widely used method for optimizing neural networks, building upon the foundation of basic gradient descent. In standard gradient descent, you update the model parameters by computing the gradient of the loss function with respect to all training data, then moving in the direction that reduces the loss. However, this can be computationally expensive for large datasets. SGD addresses this by updating parameters using the gradient from a single data point or a small batch, introducing randomness into each update. This allows for faster iterations and can help the optimizer escape shallow local minima.
Momentum is an extension of SGD that introduces a memory term to the optimization process. Instead of updating parameters solely based on the current gradient, momentum accumulates an exponentially decaying moving average of past gradients. Mathematically, while vanilla SGD updates parameters as:
ΞΈ=ΞΈβΞ·β βL(ΞΈ)where ΞΈ is the parameter vector, Ξ· is the learning rate, and βL(ΞΈ) is the gradient of the loss, SGD with momentum updates as:
v=ΞΌβ vβΞ·β βL(ΞΈ)ΞΈ=ΞΈ+vHere, v is the velocity (the accumulated gradient), and ΞΌ is the momentum coefficient (commonly set between 0.5 and 0.9). This approach helps the optimizer maintain direction in valleys and dampen oscillations, resulting in faster and more stable convergence, especially on complex loss surfaces.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667import torch import matplotlib.pyplot as plt # Simple quadratic function: f(x) = (x - 3)^2 + 4 def func(x): return (x - 3) ** 2 + 4 def grad(x): return 2 * (x - 3) # SGD optimizer def sgd(start, lr, steps): x = torch.tensor([start], dtype=torch.float32) history = [x.item()] for _ in range(steps): g = grad(x) x = x - lr * g history.append(x.item()) return history # SGD with Momentum optimizer def sgd_momentum(start, lr, steps, momentum=0.9): x = torch.tensor([start], dtype=torch.float32) v = torch.tensor([0.0], dtype=torch.float32) history = [x.item()] for _ in range(steps): g = grad(x) v = momentum * v - lr * g x = x + v history.append(x.item()) return history # Run both optimizers start_point = -1.0 lr = 0.07 steps = 15 sgd_hist = sgd(start_point, lr, steps) momentum_hist = sgd_momentum(start_point, lr, steps, momentum=0.8) # Prepare data for plotting x_vals = torch.linspace(-1, 5, 100) y_vals = func(x_vals) sgd_y = [func(torch.tensor([x])) for x in sgd_hist] momentum_y = [func(torch.tensor([x])) for x in momentum_hist] # Create two subplots for better visibility fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # First subplot: SGD axes[0].plot(x_vals, y_vals, label="Loss surface: f(x)", color="gray") axes[0].plot(sgd_hist, sgd_y, marker='o', color="blue", label="SGD Path") axes[0].set_xlabel("x") axes[0].set_ylabel("f(x)") axes[0].set_title("SGD on Quadratic Function") axes[0].legend() # Second subplot: SGD with Momentum axes[1].plot(x_vals, y_vals, label="Loss surface: f(x)", color="gray") axes[1].plot(momentum_hist, momentum_y, marker='x', color="red", label="SGD with Momentum Path") axes[1].set_xlabel("x") axes[1].set_ylabel("f(x)") axes[1].set_title("SGD with Momentum on Quadratic Function") axes[1].legend() plt.tight_layout() plt.show()
When comparing the convergence behavior of SGD and SGD with momentum using the quadratic function example, you can observe that vanilla SGD takes steady but sometimes inefficient steps directly against the gradient. This can lead to slow progress, especially if the loss surface has ravines or plateaus. In contrast, introducing momentum allows the optimizer to "remember" previous gradients and build up velocity in directions of consistent descent. This results in faster movement toward the minimum and helps to smooth out oscillations that can occur when gradients point in varying directions. The plotted trajectories reveal that momentum accelerates convergence and leads to a more direct path to the optimal value, while standard SGD may zigzag or take smaller, less efficient steps.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain the main differences between SGD and SGD with momentum in simple terms?
How does the momentum parameter affect the optimization process?
Can you provide more examples of when momentum is especially useful?
Awesome!
Completion rate improved to 8.33
Stochastic Gradient Descent and Momentum
Swipe to show menu
Stochastic gradient descent (SGD) is a widely used method for optimizing neural networks, building upon the foundation of basic gradient descent. In standard gradient descent, you update the model parameters by computing the gradient of the loss function with respect to all training data, then moving in the direction that reduces the loss. However, this can be computationally expensive for large datasets. SGD addresses this by updating parameters using the gradient from a single data point or a small batch, introducing randomness into each update. This allows for faster iterations and can help the optimizer escape shallow local minima.
Momentum is an extension of SGD that introduces a memory term to the optimization process. Instead of updating parameters solely based on the current gradient, momentum accumulates an exponentially decaying moving average of past gradients. Mathematically, while vanilla SGD updates parameters as:
ΞΈ=ΞΈβΞ·β βL(ΞΈ)where ΞΈ is the parameter vector, Ξ· is the learning rate, and βL(ΞΈ) is the gradient of the loss, SGD with momentum updates as:
v=ΞΌβ vβΞ·β βL(ΞΈ)ΞΈ=ΞΈ+vHere, v is the velocity (the accumulated gradient), and ΞΌ is the momentum coefficient (commonly set between 0.5 and 0.9). This approach helps the optimizer maintain direction in valleys and dampen oscillations, resulting in faster and more stable convergence, especially on complex loss surfaces.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667import torch import matplotlib.pyplot as plt # Simple quadratic function: f(x) = (x - 3)^2 + 4 def func(x): return (x - 3) ** 2 + 4 def grad(x): return 2 * (x - 3) # SGD optimizer def sgd(start, lr, steps): x = torch.tensor([start], dtype=torch.float32) history = [x.item()] for _ in range(steps): g = grad(x) x = x - lr * g history.append(x.item()) return history # SGD with Momentum optimizer def sgd_momentum(start, lr, steps, momentum=0.9): x = torch.tensor([start], dtype=torch.float32) v = torch.tensor([0.0], dtype=torch.float32) history = [x.item()] for _ in range(steps): g = grad(x) v = momentum * v - lr * g x = x + v history.append(x.item()) return history # Run both optimizers start_point = -1.0 lr = 0.07 steps = 15 sgd_hist = sgd(start_point, lr, steps) momentum_hist = sgd_momentum(start_point, lr, steps, momentum=0.8) # Prepare data for plotting x_vals = torch.linspace(-1, 5, 100) y_vals = func(x_vals) sgd_y = [func(torch.tensor([x])) for x in sgd_hist] momentum_y = [func(torch.tensor([x])) for x in momentum_hist] # Create two subplots for better visibility fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # First subplot: SGD axes[0].plot(x_vals, y_vals, label="Loss surface: f(x)", color="gray") axes[0].plot(sgd_hist, sgd_y, marker='o', color="blue", label="SGD Path") axes[0].set_xlabel("x") axes[0].set_ylabel("f(x)") axes[0].set_title("SGD on Quadratic Function") axes[0].legend() # Second subplot: SGD with Momentum axes[1].plot(x_vals, y_vals, label="Loss surface: f(x)", color="gray") axes[1].plot(momentum_hist, momentum_y, marker='x', color="red", label="SGD with Momentum Path") axes[1].set_xlabel("x") axes[1].set_ylabel("f(x)") axes[1].set_title("SGD with Momentum on Quadratic Function") axes[1].legend() plt.tight_layout() plt.show()
When comparing the convergence behavior of SGD and SGD with momentum using the quadratic function example, you can observe that vanilla SGD takes steady but sometimes inefficient steps directly against the gradient. This can lead to slow progress, especially if the loss surface has ravines or plateaus. In contrast, introducing momentum allows the optimizer to "remember" previous gradients and build up velocity in directions of consistent descent. This results in faster movement toward the minimum and helps to smooth out oscillations that can occur when gradients point in varying directions. The plotted trajectories reveal that momentum accelerates convergence and leads to a more direct path to the optimal value, while standard SGD may zigzag or take smaller, less efficient steps.
Thanks for your feedback!