Apprendre RMSProp and Adagrad

To understand how adaptive learning rates improve optimization in machine learning, you will derive and compare two influential algorithms: Adagrad and RMSProp. Both methods modify the basic gradient descent update by adapting the learning rate for each parameter based on historical gradient information, but they do so in different ways, which leads to distinct behaviors during training.

Mathematical Derivation of Adagrad

Adagrad adjusts the learning rate for each parameter according to the sum of the squares of all previous gradients. The update rule for a parameter $\theta$ at step $t$ is:

g_t = \nabla_\theta L(\theta_t)

G_t = G_{t-1} + g_t^2

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t

Here, $G_t$ is the running sum of squared gradients (element-wise), $η$ is the initial learning rate, and $ε$ is a small value to avoid division by zero. As $G_t$ accumulates, the effective learning rate for each parameter decreases, especially for parameters with large gradients.

Mathematical Derivation of RMSProp

RMSProp modifies Adagrad by using an exponentially decaying average of squared gradients instead of a cumulative sum. This prevents the learning rate from shrinking too quickly. The RMSProp update is:

E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma) g_t^2

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t

Here, $γ$ is the decay rate (typical values are around 0.9). By using a moving average, RMSProp maintains a more stable and responsive adaptation of learning rates compared to Adagrad.

Differences Between Adagrad and RMSProp

Adagrad accumulates all past squared gradients, causing the learning rate to shrink and potentially stop learning;
RMSProp uses a moving average, so the learning rate adapts but does not vanish;
Both methods adjust learning rates per-parameter, but RMSProp is typically preferred for non-convex problems and deep learning due to its stability.

Note

Adaptive learning rates help optimization by scaling each parameter's update according to how frequently or infrequently it receives large gradients. Parameters with consistently large gradients get smaller updates, while those with small or infrequent gradients get larger updates. This allows the optimizer to progress quickly along flat directions and more cautiously along steep ones, improving convergence and stability, especially in high-dimensional or sparse problems.


              12345678910111213141516171819202122232425262728293031323334353637383940414243444546
            
import numpy as np
import matplotlib.pyplot as plt

# Simple quadratic function: f(x) = 0.5 * x^2
def grad(x):
    return x

def adagrad_update(x, lr, G, epsilon=1e-8):
    g = grad(x)
    G += g**2
    x_new = x - lr / (np.sqrt(G) + epsilon) * g
    return x_new, G

def rmsprop_update(x, lr, Eg2, gamma=0.9, epsilon=1e-8):
    g = grad(x)
    Eg2 = gamma * Eg2 + (1 - gamma) * g**2
    x_new = x - lr / (np.sqrt(Eg2) + epsilon) * g
    return x_new, Eg2

steps = 50
x0 = 5.0
lr = 1.0

# Adagrad
x_a = x0
G = 0
adagrad_traj = [x_a]
for _ in range(steps):
    x_a, G = adagrad_update(x_a, lr, G)
    adagrad_traj.append(x_a)

# RMSProp
x_r = x0
Eg2 = 0
rmsprop_traj = [x_r]
for _ in range(steps):
    x_r, Eg2 = rmsprop_update(x_r, lr, Eg2)
    rmsprop_traj.append(x_r)

plt.plot(adagrad_traj, label="Adagrad")
plt.plot(rmsprop_traj, label="RMSProp")
plt.xlabel("Step")
plt.ylabel("Parameter value")
plt.title("Parameter updates: Adagrad vs RMSProp")
plt.legend()
plt.show()

Tout était clair ?

Merci pour vos commentaires !

Section 5. Chapitre 1

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Suggested prompts:

Can you explain why Adagrad's learning rate shrinks faster than RMSProp's?

What are some practical scenarios where RMSProp is preferred over Adagrad?

Can you describe how the moving average in RMSProp helps with non-convex optimization?

Awesome!

Completion rate improved to 5.56

Glissez pour afficher le menu