Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Nesterov Acceleration | Momentum and Acceleration
Mathematics of Optimization in ML

bookNesterov Acceleration

Nesterov acceleration, also known as Nesterov Accelerated Gradient (NAG), is a refinement of the standard momentum method used in gradient-based optimization. To understand its derivation, start by recalling the standard momentum update:

  • The parameter update uses both the current gradient and a velocity term;
  • The velocity accumulates gradients with exponential decay, smoothing the trajectory;
  • The parameter is then updated by the velocity.

Mathematically, standard momentum is expressed as:

  • Let: vt+1=μvtηf(xt)v_{t+1} = \mu v_t - \eta \nabla f(x_t)
  • Update parameters: xt+1=xt+vt+1x_{t+1} = x_t + v_{t+1}

Here, μ\mu is the momentum coefficient (typically between 0.5 and 0.99) and η\eta is the learning rate.

Nesterov acceleration modifies this by calculating the gradient not at the current position, but at a lookahead position. This subtle change gives NAG a predictive quality:

  • Compute a lookahead position: xt+μvtx_{\raisebox{-1pt}{$t$}} + \mu v_{\raisebox{-1pt}{$t$}};
  • Evaluate the gradient at this lookahead point;
  • Update the velocity and parameters accordingly.

The Nesterov update equations are:

  • vt+1=μvtηf(xt+μvt)v_{\raisebox{-1pt}{$t+1$}} = \mu v_{\raisebox{-1pt}{$t$}} - \eta \nabla f(x_{\raisebox{-1pt}{$t$}} + \mu v_{\raisebox{-1pt}{$t$}})
  • xt+1=xt+vt+1x_{\raisebox{-1pt}{$t+1$}} = x_{\raisebox{-1pt}{$t$}} + v_{\raisebox{-1pt}{$t+1$}}

This modification means NAG anticipates the future position, leading to better-informed updates, especially in regions where the loss function curves sharply. Compared to standard momentum, NAG tends to provide:

  • Faster convergence in many convex and non-convex problems;
  • More stable optimization paths, reducing overshooting near minima;
  • Enhanced responsiveness to changes in the gradient's direction.

The main difference lies in where the gradient is evaluated: standard momentum uses the current parameters, while Nesterov momentum looks ahead by the momentum term before computing the gradient. This leads to more accurate and adaptive updates.

Note
Definition

Nesterov acceleration's key innovation is the lookahead property, before updating, it evaluates the gradient at a predicted future position instead of the current one. This anticipatory step allows the optimizer to "see" further along the optimization path, often resulting in smoother, more stable convergence and reducing the risk of overshooting minima.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np import matplotlib.pyplot as plt def f(x, y): return 0.5 * x**2 + 2 * y**2 def grad_f(x, y): return np.array([x, 4*y]) def optimize_path(optimizer, steps=30, lr=0.2, mu=0.9): x, y = 2.0, 2.0 v_x, v_y = 0.0, 0.0 path = [(x, y)] for _ in range(steps): if optimizer == 'momentum': grad = grad_f(x, y) v_x = mu * v_x - lr * grad[0] v_y = mu * v_y - lr * grad[1] x += v_x y += v_y elif optimizer == 'nesterov': lookahead_x = x + mu * v_x lookahead_y = y + mu * v_y grad = grad_f(lookahead_x, lookahead_y) v_x = mu * v_x - lr * grad[0] v_y = mu * v_y - lr * grad[1] x += v_x y += v_y path.append((x, y)) return np.array(path) path_momentum = optimize_path('momentum') path_nesterov = optimize_path('nesterov') # Plot contours of the function x_grid = np.linspace(-2, 2, 100) y_grid = np.linspace(-2, 2, 100) X, Y = np.meshgrid(x_grid, y_grid) Z = f(X, Y) plt.figure(figsize=(7, 6)) plt.contour(X, Y, Z, levels=20, cmap='gray') plt.plot(path_momentum[:, 0], path_momentum[:, 1], 'o-', label='Momentum', color='blue') plt.plot(path_nesterov[:, 0], path_nesterov[:, 1], 'o-', label='Nesterov', color='orange') plt.scatter([0], [0], color='red', label='Minimum', zorder=10) plt.legend() plt.title('Optimization Paths: Momentum vs Nesterov Acceleration') plt.xlabel('x') plt.ylabel('y') plt.show()
copy
question mark

Which of the following statements about Nesterov acceleration are correct?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 4. Kapitel 2

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain how the lookahead step in Nesterov acceleration improves convergence?

What are the main differences in the optimization paths between standard momentum and Nesterov acceleration?

Can you provide more intuition or a visual explanation for why Nesterov acceleration is more stable?

Awesome!

Completion rate improved to 5.56

bookNesterov Acceleration

Svep för att visa menyn

Nesterov acceleration, also known as Nesterov Accelerated Gradient (NAG), is a refinement of the standard momentum method used in gradient-based optimization. To understand its derivation, start by recalling the standard momentum update:

  • The parameter update uses both the current gradient and a velocity term;
  • The velocity accumulates gradients with exponential decay, smoothing the trajectory;
  • The parameter is then updated by the velocity.

Mathematically, standard momentum is expressed as:

  • Let: vt+1=μvtηf(xt)v_{t+1} = \mu v_t - \eta \nabla f(x_t)
  • Update parameters: xt+1=xt+vt+1x_{t+1} = x_t + v_{t+1}

Here, μ\mu is the momentum coefficient (typically between 0.5 and 0.99) and η\eta is the learning rate.

Nesterov acceleration modifies this by calculating the gradient not at the current position, but at a lookahead position. This subtle change gives NAG a predictive quality:

  • Compute a lookahead position: xt+μvtx_{\raisebox{-1pt}{$t$}} + \mu v_{\raisebox{-1pt}{$t$}};
  • Evaluate the gradient at this lookahead point;
  • Update the velocity and parameters accordingly.

The Nesterov update equations are:

  • vt+1=μvtηf(xt+μvt)v_{\raisebox{-1pt}{$t+1$}} = \mu v_{\raisebox{-1pt}{$t$}} - \eta \nabla f(x_{\raisebox{-1pt}{$t$}} + \mu v_{\raisebox{-1pt}{$t$}})
  • xt+1=xt+vt+1x_{\raisebox{-1pt}{$t+1$}} = x_{\raisebox{-1pt}{$t$}} + v_{\raisebox{-1pt}{$t+1$}}

This modification means NAG anticipates the future position, leading to better-informed updates, especially in regions where the loss function curves sharply. Compared to standard momentum, NAG tends to provide:

  • Faster convergence in many convex and non-convex problems;
  • More stable optimization paths, reducing overshooting near minima;
  • Enhanced responsiveness to changes in the gradient's direction.

The main difference lies in where the gradient is evaluated: standard momentum uses the current parameters, while Nesterov momentum looks ahead by the momentum term before computing the gradient. This leads to more accurate and adaptive updates.

Note
Definition

Nesterov acceleration's key innovation is the lookahead property, before updating, it evaluates the gradient at a predicted future position instead of the current one. This anticipatory step allows the optimizer to "see" further along the optimization path, often resulting in smoother, more stable convergence and reducing the risk of overshooting minima.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np import matplotlib.pyplot as plt def f(x, y): return 0.5 * x**2 + 2 * y**2 def grad_f(x, y): return np.array([x, 4*y]) def optimize_path(optimizer, steps=30, lr=0.2, mu=0.9): x, y = 2.0, 2.0 v_x, v_y = 0.0, 0.0 path = [(x, y)] for _ in range(steps): if optimizer == 'momentum': grad = grad_f(x, y) v_x = mu * v_x - lr * grad[0] v_y = mu * v_y - lr * grad[1] x += v_x y += v_y elif optimizer == 'nesterov': lookahead_x = x + mu * v_x lookahead_y = y + mu * v_y grad = grad_f(lookahead_x, lookahead_y) v_x = mu * v_x - lr * grad[0] v_y = mu * v_y - lr * grad[1] x += v_x y += v_y path.append((x, y)) return np.array(path) path_momentum = optimize_path('momentum') path_nesterov = optimize_path('nesterov') # Plot contours of the function x_grid = np.linspace(-2, 2, 100) y_grid = np.linspace(-2, 2, 100) X, Y = np.meshgrid(x_grid, y_grid) Z = f(X, Y) plt.figure(figsize=(7, 6)) plt.contour(X, Y, Z, levels=20, cmap='gray') plt.plot(path_momentum[:, 0], path_momentum[:, 1], 'o-', label='Momentum', color='blue') plt.plot(path_nesterov[:, 0], path_nesterov[:, 1], 'o-', label='Nesterov', color='orange') plt.scatter([0], [0], color='red', label='Minimum', zorder=10) plt.legend() plt.title('Optimization Paths: Momentum vs Nesterov Acceleration') plt.xlabel('x') plt.ylabel('y') plt.show()
copy
question mark

Which of the following statements about Nesterov acceleration are correct?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 4. Kapitel 2
some-alt