RMSProp and Adagrad
To understand how adaptive learning rates improve optimization in machine learning, you will derive and compare two influential algorithms: Adagrad and RMSProp. Both methods modify the basic gradient descent update by adapting the learning rate for each parameter based on historical gradient information, but they do so in different ways, which leads to distinct behaviors during training.
Mathematical Derivation of Adagrad
Adagrad adjusts the learning rate for each parameter according to the sum of the squares of all previous gradients. The update rule for a parameter θ at step t is:
gt=∇θL(θt) Gt=Gt−1+gt2 θt+1=θt−Gt+ϵηgtHere, Gt is the running sum of squared gradients (element-wise), η is the initial learning rate, and ε is a small value to avoid division by zero. As Gt accumulates, the effective learning rate for each parameter decreases, especially for parameters with large gradients.
Mathematical Derivation of RMSProp
RMSProp modifies Adagrad by using an exponentially decaying average of squared gradients instead of a cumulative sum. This prevents the learning rate from shrinking too quickly. The RMSProp update is:
E[g2]t=γE[g2]t−1+(1−γ)gt2 θt+1=θt−E[g2]t+ϵηgtHere, γ is the decay rate (typical values are around 0.9). By using a moving average, RMSProp maintains a more stable and responsive adaptation of learning rates compared to Adagrad.
Differences Between Adagrad and RMSProp
- Adagrad accumulates all past squared gradients, causing the learning rate to shrink and potentially stop learning;
- RMSProp uses a moving average, so the learning rate adapts but does not vanish;
- Both methods adjust learning rates per-parameter, but RMSProp is typically preferred for non-convex problems and deep learning due to its stability.
Adaptive learning rates help optimization by scaling each parameter's update according to how frequently or infrequently it receives large gradients. Parameters with consistently large gradients get smaller updates, while those with small or infrequent gradients get larger updates. This allows the optimizer to progress quickly along flat directions and more cautiously along steep ones, improving convergence and stability, especially in high-dimensional or sparse problems.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546import numpy as np import matplotlib.pyplot as plt # Simple quadratic function: f(x) = 0.5 * x^2 def grad(x): return x def adagrad_update(x, lr, G, epsilon=1e-8): g = grad(x) G += g**2 x_new = x - lr / (np.sqrt(G) + epsilon) * g return x_new, G def rmsprop_update(x, lr, Eg2, gamma=0.9, epsilon=1e-8): g = grad(x) Eg2 = gamma * Eg2 + (1 - gamma) * g**2 x_new = x - lr / (np.sqrt(Eg2) + epsilon) * g return x_new, Eg2 steps = 50 x0 = 5.0 lr = 1.0 # Adagrad x_a = x0 G = 0 adagrad_traj = [x_a] for _ in range(steps): x_a, G = adagrad_update(x_a, lr, G) adagrad_traj.append(x_a) # RMSProp x_r = x0 Eg2 = 0 rmsprop_traj = [x_r] for _ in range(steps): x_r, Eg2 = rmsprop_update(x_r, lr, Eg2) rmsprop_traj.append(x_r) plt.plot(adagrad_traj, label="Adagrad") plt.plot(rmsprop_traj, label="RMSProp") plt.xlabel("Step") plt.ylabel("Parameter value") plt.title("Parameter updates: Adagrad vs RMSProp") plt.legend() plt.show()
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Can you explain why Adagrad's learning rate shrinks faster than RMSProp's?
What are some practical scenarios where RMSProp is preferred over Adagrad?
Can you describe how the moving average in RMSProp helps with non-convex optimization?
Awesome!
Completion rate improved to 5.56
RMSProp and Adagrad
Deslize para mostrar o menu
To understand how adaptive learning rates improve optimization in machine learning, you will derive and compare two influential algorithms: Adagrad and RMSProp. Both methods modify the basic gradient descent update by adapting the learning rate for each parameter based on historical gradient information, but they do so in different ways, which leads to distinct behaviors during training.
Mathematical Derivation of Adagrad
Adagrad adjusts the learning rate for each parameter according to the sum of the squares of all previous gradients. The update rule for a parameter θ at step t is:
gt=∇θL(θt) Gt=Gt−1+gt2 θt+1=θt−Gt+ϵηgtHere, Gt is the running sum of squared gradients (element-wise), η is the initial learning rate, and ε is a small value to avoid division by zero. As Gt accumulates, the effective learning rate for each parameter decreases, especially for parameters with large gradients.
Mathematical Derivation of RMSProp
RMSProp modifies Adagrad by using an exponentially decaying average of squared gradients instead of a cumulative sum. This prevents the learning rate from shrinking too quickly. The RMSProp update is:
E[g2]t=γE[g2]t−1+(1−γ)gt2 θt+1=θt−E[g2]t+ϵηgtHere, γ is the decay rate (typical values are around 0.9). By using a moving average, RMSProp maintains a more stable and responsive adaptation of learning rates compared to Adagrad.
Differences Between Adagrad and RMSProp
- Adagrad accumulates all past squared gradients, causing the learning rate to shrink and potentially stop learning;
- RMSProp uses a moving average, so the learning rate adapts but does not vanish;
- Both methods adjust learning rates per-parameter, but RMSProp is typically preferred for non-convex problems and deep learning due to its stability.
Adaptive learning rates help optimization by scaling each parameter's update according to how frequently or infrequently it receives large gradients. Parameters with consistently large gradients get smaller updates, while those with small or infrequent gradients get larger updates. This allows the optimizer to progress quickly along flat directions and more cautiously along steep ones, improving convergence and stability, especially in high-dimensional or sparse problems.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546import numpy as np import matplotlib.pyplot as plt # Simple quadratic function: f(x) = 0.5 * x^2 def grad(x): return x def adagrad_update(x, lr, G, epsilon=1e-8): g = grad(x) G += g**2 x_new = x - lr / (np.sqrt(G) + epsilon) * g return x_new, G def rmsprop_update(x, lr, Eg2, gamma=0.9, epsilon=1e-8): g = grad(x) Eg2 = gamma * Eg2 + (1 - gamma) * g**2 x_new = x - lr / (np.sqrt(Eg2) + epsilon) * g return x_new, Eg2 steps = 50 x0 = 5.0 lr = 1.0 # Adagrad x_a = x0 G = 0 adagrad_traj = [x_a] for _ in range(steps): x_a, G = adagrad_update(x_a, lr, G) adagrad_traj.append(x_a) # RMSProp x_r = x0 Eg2 = 0 rmsprop_traj = [x_r] for _ in range(steps): x_r, Eg2 = rmsprop_update(x_r, lr, Eg2) rmsprop_traj.append(x_r) plt.plot(adagrad_traj, label="Adagrad") plt.plot(rmsprop_traj, label="RMSProp") plt.xlabel("Step") plt.ylabel("Parameter value") plt.title("Parameter updates: Adagrad vs RMSProp") plt.legend() plt.show()
Obrigado pelo seu feedback!