Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Momentum in Optimization | Momentum and Acceleration
Mathematics of Optimization in ML

bookMomentum in Optimization

Momentum is a key technique in optimization that helps you accelerate gradient descent by incorporating information from previous steps. The mathematical update rule for gradient descent with momentum introduces a velocity term, which accumulates a moving average of past gradients. The standard momentum update can be written as:

vt+1=βvtαf(θt)θt+1=θt+vt+1\begin{align*} v_{t+1} &= \beta v_t - \alpha \nabla f(\theta_t) \\ \theta_{t+1} &= \theta_t + v_{t+1} \end{align*}

Here, θtθ_t represents the current parameter vector at iteration tt, f(θt)∇f(θ_t) is the gradient of the loss function with respect to θtθ_t, αα is the learning rate, vtv_t is the velocity (initialized as zero), and ββ is the momentum coefficient, typically set between 0.5 and 0.99. The hyperparameter ββ determines how much past gradients influence the current update: higher ββ means more memory of previous directions, while lower ββ focuses more on the current gradient. This formulation allows the optimizer to build up speed in consistent directions and dampen oscillations, especially in ravines or along steep, narrow valleys in the loss surface.

Note
Note

You can think of momentum as giving the optimizer "inertia," much like a ball rolling down a hill. Instead of only reacting to the current slope, the optimizer remembers the direction it has been moving and continues in that direction unless the gradients strongly suggest otherwise. This smoothing effect helps the optimizer move faster through shallow regions and reduces erratic zig-zagging across steep slopes, leading to more stable and efficient convergence.

12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss surface def loss(x, y): return 0.5 * (4 * x**2 + y**2) def grad(x, y): return np.array([4 * x, 2 * y]) # Gradient descent with and without momentum def optimize(momentum=False, beta=0.9, lr=0.1, steps=30): x, y = 2.0, 2.0 v = np.zeros(2) trajectory = [(x, y)] for _ in range(steps): g = grad(x, y) if momentum: v = beta * v - lr * g x, y = x + v[0], y + v[1] else: x, y = x - lr * g[0], y - lr * g[1] trajectory.append((x, y)) return np.array(trajectory) traj_gd = optimize(momentum=False) traj_mom = optimize(momentum=True, beta=0.9) # Plotting x_vals = np.linspace(-2.5, 2.5, 100) y_vals = np.linspace(-2.5, 2.5, 100) X, Y = np.meshgrid(x_vals, y_vals) Z = loss(X, Y) plt.figure(figsize=(8, 6)) plt.contour(X, Y, Z, levels=30, cmap='Blues', alpha=0.7) plt.plot(traj_gd[:,0], traj_gd[:,1], 'o-', label='Gradient Descent', color='red') plt.plot(traj_mom[:,0], traj_mom[:,1], 'o-', label='Momentum', color='green') plt.legend() plt.title('Optimization Trajectories With and Without Momentum') plt.xlabel('x') plt.ylabel('y') plt.show()
copy
question mark

Which of the following statements best describes the effect of momentum in gradient-based optimization?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 1

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Suggested prompts:

Can you explain how momentum helps reduce oscillations in optimization?

What happens if I change the value of the momentum coefficient β?

Can you compare momentum with other optimization techniques like Adam or RMSprop?

Awesome!

Completion rate improved to 5.56

bookMomentum in Optimization

Desliza para mostrar el menú

Momentum is a key technique in optimization that helps you accelerate gradient descent by incorporating information from previous steps. The mathematical update rule for gradient descent with momentum introduces a velocity term, which accumulates a moving average of past gradients. The standard momentum update can be written as:

vt+1=βvtαf(θt)θt+1=θt+vt+1\begin{align*} v_{t+1} &= \beta v_t - \alpha \nabla f(\theta_t) \\ \theta_{t+1} &= \theta_t + v_{t+1} \end{align*}

Here, θtθ_t represents the current parameter vector at iteration tt, f(θt)∇f(θ_t) is the gradient of the loss function with respect to θtθ_t, αα is the learning rate, vtv_t is the velocity (initialized as zero), and ββ is the momentum coefficient, typically set between 0.5 and 0.99. The hyperparameter ββ determines how much past gradients influence the current update: higher ββ means more memory of previous directions, while lower ββ focuses more on the current gradient. This formulation allows the optimizer to build up speed in consistent directions and dampen oscillations, especially in ravines or along steep, narrow valleys in the loss surface.

Note
Note

You can think of momentum as giving the optimizer "inertia," much like a ball rolling down a hill. Instead of only reacting to the current slope, the optimizer remembers the direction it has been moving and continues in that direction unless the gradients strongly suggest otherwise. This smoothing effect helps the optimizer move faster through shallow regions and reduces erratic zig-zagging across steep slopes, leading to more stable and efficient convergence.

12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss surface def loss(x, y): return 0.5 * (4 * x**2 + y**2) def grad(x, y): return np.array([4 * x, 2 * y]) # Gradient descent with and without momentum def optimize(momentum=False, beta=0.9, lr=0.1, steps=30): x, y = 2.0, 2.0 v = np.zeros(2) trajectory = [(x, y)] for _ in range(steps): g = grad(x, y) if momentum: v = beta * v - lr * g x, y = x + v[0], y + v[1] else: x, y = x - lr * g[0], y - lr * g[1] trajectory.append((x, y)) return np.array(trajectory) traj_gd = optimize(momentum=False) traj_mom = optimize(momentum=True, beta=0.9) # Plotting x_vals = np.linspace(-2.5, 2.5, 100) y_vals = np.linspace(-2.5, 2.5, 100) X, Y = np.meshgrid(x_vals, y_vals) Z = loss(X, Y) plt.figure(figsize=(8, 6)) plt.contour(X, Y, Z, levels=30, cmap='Blues', alpha=0.7) plt.plot(traj_gd[:,0], traj_gd[:,1], 'o-', label='Gradient Descent', color='red') plt.plot(traj_mom[:,0], traj_mom[:,1], 'o-', label='Momentum', color='green') plt.legend() plt.title('Optimization Trajectories With and Without Momentum') plt.xlabel('x') plt.ylabel('y') plt.show()
copy
question mark

Which of the following statements best describes the effect of momentum in gradient-based optimization?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 1
some-alt