Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Apprendre RMSProp and Adagrad | Adaptive Methods
Mathematics of Optimization in ML

bookRMSProp and Adagrad

To understand how adaptive learning rates improve optimization in machine learning, you will derive and compare two influential algorithms: Adagrad and RMSProp. Both methods modify the basic gradient descent update by adapting the learning rate for each parameter based on historical gradient information, but they do so in different ways, which leads to distinct behaviors during training.

Mathematical Derivation of Adagrad

Adagrad adjusts the learning rate for each parameter according to the sum of the squares of all previous gradients. The update rule for a parameter θ\theta at step tt is:

gt=θL(θt)g_t = \nabla_\theta L(\theta_t) Gt=Gt1+gt2G_t = G_{t-1} + g_t^2 θt+1=θtηGt+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t

Here, GtG_t is the running sum of squared gradients (element-wise), ηη is the initial learning rate, and εε is a small value to avoid division by zero. As GtG_t accumulates, the effective learning rate for each parameter decreases, especially for parameters with large gradients.

Mathematical Derivation of RMSProp

RMSProp modifies Adagrad by using an exponentially decaying average of squared gradients instead of a cumulative sum. This prevents the learning rate from shrinking too quickly. The RMSProp update is:

E[g2]t=γE[g2]t1+(1γ)gt2E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma) g_t^2 θt+1=θtηE[g2]t+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t

Here, γγ is the decay rate (typical values are around 0.9). By using a moving average, RMSProp maintains a more stable and responsive adaptation of learning rates compared to Adagrad.

Differences Between Adagrad and RMSProp

  • Adagrad accumulates all past squared gradients, causing the learning rate to shrink and potentially stop learning;
  • RMSProp uses a moving average, so the learning rate adapts but does not vanish;
  • Both methods adjust learning rates per-parameter, but RMSProp is typically preferred for non-convex problems and deep learning due to its stability.
Note
Note

Adaptive learning rates help optimization by scaling each parameter's update according to how frequently or infrequently it receives large gradients. Parameters with consistently large gradients get smaller updates, while those with small or infrequent gradients get larger updates. This allows the optimizer to progress quickly along flat directions and more cautiously along steep ones, improving convergence and stability, especially in high-dimensional or sparse problems.

12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as np import matplotlib.pyplot as plt # Simple quadratic function: f(x) = 0.5 * x^2 def grad(x): return x def adagrad_update(x, lr, G, epsilon=1e-8): g = grad(x) G += g**2 x_new = x - lr / (np.sqrt(G) + epsilon) * g return x_new, G def rmsprop_update(x, lr, Eg2, gamma=0.9, epsilon=1e-8): g = grad(x) Eg2 = gamma * Eg2 + (1 - gamma) * g**2 x_new = x - lr / (np.sqrt(Eg2) + epsilon) * g return x_new, Eg2 steps = 50 x0 = 5.0 lr = 1.0 # Adagrad x_a = x0 G = 0 adagrad_traj = [x_a] for _ in range(steps): x_a, G = adagrad_update(x_a, lr, G) adagrad_traj.append(x_a) # RMSProp x_r = x0 Eg2 = 0 rmsprop_traj = [x_r] for _ in range(steps): x_r, Eg2 = rmsprop_update(x_r, lr, Eg2) rmsprop_traj.append(x_r) plt.plot(adagrad_traj, label="Adagrad") plt.plot(rmsprop_traj, label="RMSProp") plt.xlabel("Step") plt.ylabel("Parameter value") plt.title("Parameter updates: Adagrad vs RMSProp") plt.legend() plt.show()
copy
question mark

Which statement best describes the key difference between Adagrad and RMSProp?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 5. Chapitre 1

Demandez à l'IA

expand

Demandez à l'IA

ChatGPT

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Suggested prompts:

Can you explain why Adagrad's learning rate shrinks faster than RMSProp's?

What are some practical scenarios where RMSProp is preferred over Adagrad?

Can you describe how the moving average in RMSProp helps with non-convex optimization?

Awesome!

Completion rate improved to 5.56

bookRMSProp and Adagrad

Glissez pour afficher le menu

To understand how adaptive learning rates improve optimization in machine learning, you will derive and compare two influential algorithms: Adagrad and RMSProp. Both methods modify the basic gradient descent update by adapting the learning rate for each parameter based on historical gradient information, but they do so in different ways, which leads to distinct behaviors during training.

Mathematical Derivation of Adagrad

Adagrad adjusts the learning rate for each parameter according to the sum of the squares of all previous gradients. The update rule for a parameter θ\theta at step tt is:

gt=θL(θt)g_t = \nabla_\theta L(\theta_t) Gt=Gt1+gt2G_t = G_{t-1} + g_t^2 θt+1=θtηGt+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t

Here, GtG_t is the running sum of squared gradients (element-wise), ηη is the initial learning rate, and εε is a small value to avoid division by zero. As GtG_t accumulates, the effective learning rate for each parameter decreases, especially for parameters with large gradients.

Mathematical Derivation of RMSProp

RMSProp modifies Adagrad by using an exponentially decaying average of squared gradients instead of a cumulative sum. This prevents the learning rate from shrinking too quickly. The RMSProp update is:

E[g2]t=γE[g2]t1+(1γ)gt2E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma) g_t^2 θt+1=θtηE[g2]t+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t

Here, γγ is the decay rate (typical values are around 0.9). By using a moving average, RMSProp maintains a more stable and responsive adaptation of learning rates compared to Adagrad.

Differences Between Adagrad and RMSProp

  • Adagrad accumulates all past squared gradients, causing the learning rate to shrink and potentially stop learning;
  • RMSProp uses a moving average, so the learning rate adapts but does not vanish;
  • Both methods adjust learning rates per-parameter, but RMSProp is typically preferred for non-convex problems and deep learning due to its stability.
Note
Note

Adaptive learning rates help optimization by scaling each parameter's update according to how frequently or infrequently it receives large gradients. Parameters with consistently large gradients get smaller updates, while those with small or infrequent gradients get larger updates. This allows the optimizer to progress quickly along flat directions and more cautiously along steep ones, improving convergence and stability, especially in high-dimensional or sparse problems.

12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as np import matplotlib.pyplot as plt # Simple quadratic function: f(x) = 0.5 * x^2 def grad(x): return x def adagrad_update(x, lr, G, epsilon=1e-8): g = grad(x) G += g**2 x_new = x - lr / (np.sqrt(G) + epsilon) * g return x_new, G def rmsprop_update(x, lr, Eg2, gamma=0.9, epsilon=1e-8): g = grad(x) Eg2 = gamma * Eg2 + (1 - gamma) * g**2 x_new = x - lr / (np.sqrt(Eg2) + epsilon) * g return x_new, Eg2 steps = 50 x0 = 5.0 lr = 1.0 # Adagrad x_a = x0 G = 0 adagrad_traj = [x_a] for _ in range(steps): x_a, G = adagrad_update(x_a, lr, G) adagrad_traj.append(x_a) # RMSProp x_r = x0 Eg2 = 0 rmsprop_traj = [x_r] for _ in range(steps): x_r, Eg2 = rmsprop_update(x_r, lr, Eg2) rmsprop_traj.append(x_r) plt.plot(adagrad_traj, label="Adagrad") plt.plot(rmsprop_traj, label="RMSProp") plt.xlabel("Step") plt.ylabel("Parameter value") plt.title("Parameter updates: Adagrad vs RMSProp") plt.legend() plt.show()
copy
question mark

Which statement best describes the key difference between Adagrad and RMSProp?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 5. Chapitre 1
some-alt