Momentum in Optimization
Momentum is a key technique in optimization that helps you accelerate gradient descent by incorporating information from previous steps. The mathematical update rule for gradient descent with momentum introduces a velocity term, which accumulates a moving average of past gradients. The standard momentum update can be written as:
vt+1βΞΈt+1ββ=Ξ²vtββΞ±βf(ΞΈtβ)=ΞΈtβ+vt+1ββHere, ΞΈtβ represents the current parameter vector at iteration t, βf(ΞΈtβ) is the gradient of the loss function with respect to ΞΈtβ, Ξ± is the learning rate, vtβ is the velocity (initialized as zero), and Ξ² is the momentum coefficient, typically set between 0.5 and 0.99. The hyperparameter Ξ² determines how much past gradients influence the current update: higher Ξ² means more memory of previous directions, while lower Ξ² focuses more on the current gradient. This formulation allows the optimizer to build up speed in consistent directions and dampen oscillations, especially in ravines or along steep, narrow valleys in the loss surface.
You can think of momentum as giving the optimizer "inertia," much like a ball rolling down a hill. Instead of only reacting to the current slope, the optimizer remembers the direction it has been moving and continues in that direction unless the gradients strongly suggest otherwise. This smoothing effect helps the optimizer move faster through shallow regions and reduces erratic zig-zagging across steep slopes, leading to more stable and efficient convergence.
12345678910111213141516171819202122232425262728293031323334353637383940414243import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss surface def loss(x, y): return 0.5 * (4 * x**2 + y**2) def grad(x, y): return np.array([4 * x, 2 * y]) # Gradient descent with and without momentum def optimize(momentum=False, beta=0.9, lr=0.1, steps=30): x, y = 2.0, 2.0 v = np.zeros(2) trajectory = [(x, y)] for _ in range(steps): g = grad(x, y) if momentum: v = beta * v - lr * g x, y = x + v[0], y + v[1] else: x, y = x - lr * g[0], y - lr * g[1] trajectory.append((x, y)) return np.array(trajectory) traj_gd = optimize(momentum=False) traj_mom = optimize(momentum=True, beta=0.9) # Plotting x_vals = np.linspace(-2.5, 2.5, 100) y_vals = np.linspace(-2.5, 2.5, 100) X, Y = np.meshgrid(x_vals, y_vals) Z = loss(X, Y) plt.figure(figsize=(8, 6)) plt.contour(X, Y, Z, levels=30, cmap='Blues', alpha=0.7) plt.plot(traj_gd[:,0], traj_gd[:,1], 'o-', label='Gradient Descent', color='red') plt.plot(traj_mom[:,0], traj_mom[:,1], 'o-', label='Momentum', color='green') plt.legend() plt.title('Optimization Trajectories With and Without Momentum') plt.xlabel('x') plt.ylabel('y') plt.show()
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 5.56
Momentum in Optimization
Swipe to show menu
Momentum is a key technique in optimization that helps you accelerate gradient descent by incorporating information from previous steps. The mathematical update rule for gradient descent with momentum introduces a velocity term, which accumulates a moving average of past gradients. The standard momentum update can be written as:
vt+1βΞΈt+1ββ=Ξ²vtββΞ±βf(ΞΈtβ)=ΞΈtβ+vt+1ββHere, ΞΈtβ represents the current parameter vector at iteration t, βf(ΞΈtβ) is the gradient of the loss function with respect to ΞΈtβ, Ξ± is the learning rate, vtβ is the velocity (initialized as zero), and Ξ² is the momentum coefficient, typically set between 0.5 and 0.99. The hyperparameter Ξ² determines how much past gradients influence the current update: higher Ξ² means more memory of previous directions, while lower Ξ² focuses more on the current gradient. This formulation allows the optimizer to build up speed in consistent directions and dampen oscillations, especially in ravines or along steep, narrow valleys in the loss surface.
You can think of momentum as giving the optimizer "inertia," much like a ball rolling down a hill. Instead of only reacting to the current slope, the optimizer remembers the direction it has been moving and continues in that direction unless the gradients strongly suggest otherwise. This smoothing effect helps the optimizer move faster through shallow regions and reduces erratic zig-zagging across steep slopes, leading to more stable and efficient convergence.
12345678910111213141516171819202122232425262728293031323334353637383940414243import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss surface def loss(x, y): return 0.5 * (4 * x**2 + y**2) def grad(x, y): return np.array([4 * x, 2 * y]) # Gradient descent with and without momentum def optimize(momentum=False, beta=0.9, lr=0.1, steps=30): x, y = 2.0, 2.0 v = np.zeros(2) trajectory = [(x, y)] for _ in range(steps): g = grad(x, y) if momentum: v = beta * v - lr * g x, y = x + v[0], y + v[1] else: x, y = x - lr * g[0], y - lr * g[1] trajectory.append((x, y)) return np.array(trajectory) traj_gd = optimize(momentum=False) traj_mom = optimize(momentum=True, beta=0.9) # Plotting x_vals = np.linspace(-2.5, 2.5, 100) y_vals = np.linspace(-2.5, 2.5, 100) X, Y = np.meshgrid(x_vals, y_vals) Z = loss(X, Y) plt.figure(figsize=(8, 6)) plt.contour(X, Y, Z, levels=30, cmap='Blues', alpha=0.7) plt.plot(traj_gd[:,0], traj_gd[:,1], 'o-', label='Gradient Descent', color='red') plt.plot(traj_mom[:,0], traj_mom[:,1], 'o-', label='Momentum', color='green') plt.legend() plt.title('Optimization Trajectories With and Without Momentum') plt.xlabel('x') plt.ylabel('y') plt.show()
Thanks for your feedback!