Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn RMSProp and Adam Optimizers | Optimization Algorithms in Practice
Optimization and Regularization in Neural Networks with Python

bookRMSProp and Adam Optimizers

RMSProp and Adam are two of the most widely used adaptive optimization algorithms in deep learning. Both optimizers address the limitations of basic stochastic gradient descent (SGD) and momentum, especially when training deep neural networks on complex, noisy, or non-stationary data. The key innovation in both RMSProp and Adam is their ability to adapt the learning rate for each parameter individually, based on the history of gradients.

RMSProp

RMSProp (Root Mean Square Propagation) maintains a moving average of the squared gradients for each parameter. The moving average for the squared gradient at time step tt is calculated as:

E[g2]t=ρ⋅E[g2]tβˆ’1+(1βˆ’Ο)β‹…gt2E[g^2]_t = \rho \cdot E[g^2]_{t-1} + (1 - \rho) \cdot g_t^2

where:

  • E[g2]tE[g^2]_t is the moving average of the squared gradients at time tt;
  • ρ\rho is the decay rate (commonly set to 0.90.9);
  • gtg_t is the gradient at time tt.

The parameter update is then scaled by this moving average:

ΞΈt+1=ΞΈtβˆ’Ξ·E[g2]t+Ο΅β‹…gt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t

where:

  • ΞΈt\theta_t is the parameter at time tt;
  • Ξ·\eta is the learning rate;
  • Ο΅\epsilon is a small constant to prevent division by zero.

This adaptive scaling helps stabilize training by normalizing updates: parameters with consistently large gradients receive smaller updates, while those with smaller or infrequent gradients receive relatively larger updates.

Adam

Adam (Adaptive Moment Estimation) builds upon RMSProp by maintaining two moving averages:

  • The first moment (mean of gradients):
mt=Ξ²1β‹…mtβˆ’1+(1βˆ’Ξ²1)β‹…gtm_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t
  • The second moment (uncentered variance of gradients):
vt=Ξ²2β‹…vtβˆ’1+(1βˆ’Ξ²2)β‹…gt2v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2

where:

  • mtm_t is the first moment estimate at time tt;
  • vtv_t is the second moment estimate at time tt;
  • Ξ²1\beta_1 and Ξ²2\beta_2 are decay rates for the moments (commonly 0.90.9 and 0.9990.999).

Adam also includes bias correction terms to counteract initialization effects:

m^t=mt1βˆ’Ξ²1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t} v^t=vt1βˆ’Ξ²2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

The parameter update step for Adam is:

ΞΈt+1=ΞΈtβˆ’Ξ·v^t+Ο΅β‹…m^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t

Adam combines the benefits of momentum (smoother updates) and adaptive learning rates (per-parameter scaling), often leading to faster and more robust convergence.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import torch import torch.nn as nn import torch.optim as optim import matplotlib.pyplot as plt # Example model def make_model(): return nn.Sequential( nn.Linear(32, 64), nn.ReLU(), nn.Linear(64, 10), nn.Softmax(dim=1) ) model_rmsprop = make_model() model_adam = make_model() # Configuring RMSProp optimizer rmsprop_optimizer = optim.RMSprop( model_rmsprop.parameters(), lr=0.001, # Learning rate alpha=0.9, # Smoothing constant momentum=0.0, # Momentum parameter eps=1e-7 # Small value to avoid division by zero ) # Configuring Adam optimizer adam_optimizer = optim.Adam( model_adam.parameters(), lr=0.001, # Learning rate betas=(0.9, 0.999), # Exponential decay rates eps=1e-7 # Small value to avoid division by zero ) loss_fn = nn.CrossEntropyLoss() # Store loss values for plotting losses_rmsprop = [] losses_adam = [] for i in range(2): # New random inputs and labels per iteration inputs = torch.randn(16, 32) labels = torch.randint(0, 10, (16,)) # RMSProp step outputs_rmsprop = model_rmsprop(inputs) loss_rmsprop = loss_fn(outputs_rmsprop, labels) loss_rmsprop.backward() rmsprop_optimizer.step() rmsprop_optimizer.zero_grad() losses_rmsprop.append(loss_rmsprop.item()) print(f"[RMSProp] Iteration {i+1}, Loss: {loss_rmsprop.item():.4f}") # Adam step outputs_adam = model_adam(inputs) loss_adam = loss_fn(outputs_adam, labels) loss_adam.backward() adam_optimizer.step() adam_optimizer.zero_grad() losses_adam.append(loss_adam.item()) print(f"[Adam] Iteration {i+1}, Loss: {loss_adam.item():.4f}")
copy

Adaptive optimizers such as RMSProp and Adam often outperform SGD in specific situations:

  • When your data is noisy;
  • When the loss surface is highly non-convex with many local minima and saddle points;
  • When gradients vary significantly across parameters.

In these cases, adaptive optimizers help your model escape poor regions of the loss surface more efficiently. They are also valuable when you have limited time for manual learning rate tuning, or when different parameters require different learning rates due to varying input scales or network depth.

However, while adaptive optimizers usually provide faster convergence and are more forgiving with hyperparameter choices, they can sometimes result in worse generalization compared to SGD with momentum, especially on large-scale supervised learning tasks.

question mark

Which of the following best describes a key advantage of adaptive optimizers like RMSProp and Adam over standard SGD?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the main differences between RMSProp and Adam in simple terms?

When should I choose RMSProp over Adam, or vice versa?

Are there any drawbacks to using adaptive optimizers like RMSProp and Adam?

bookRMSProp and Adam Optimizers

Swipe to show menu

RMSProp and Adam are two of the most widely used adaptive optimization algorithms in deep learning. Both optimizers address the limitations of basic stochastic gradient descent (SGD) and momentum, especially when training deep neural networks on complex, noisy, or non-stationary data. The key innovation in both RMSProp and Adam is their ability to adapt the learning rate for each parameter individually, based on the history of gradients.

RMSProp

RMSProp (Root Mean Square Propagation) maintains a moving average of the squared gradients for each parameter. The moving average for the squared gradient at time step tt is calculated as:

E[g2]t=ρ⋅E[g2]tβˆ’1+(1βˆ’Ο)β‹…gt2E[g^2]_t = \rho \cdot E[g^2]_{t-1} + (1 - \rho) \cdot g_t^2

where:

  • E[g2]tE[g^2]_t is the moving average of the squared gradients at time tt;
  • ρ\rho is the decay rate (commonly set to 0.90.9);
  • gtg_t is the gradient at time tt.

The parameter update is then scaled by this moving average:

ΞΈt+1=ΞΈtβˆ’Ξ·E[g2]t+Ο΅β‹…gt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t

where:

  • ΞΈt\theta_t is the parameter at time tt;
  • Ξ·\eta is the learning rate;
  • Ο΅\epsilon is a small constant to prevent division by zero.

This adaptive scaling helps stabilize training by normalizing updates: parameters with consistently large gradients receive smaller updates, while those with smaller or infrequent gradients receive relatively larger updates.

Adam

Adam (Adaptive Moment Estimation) builds upon RMSProp by maintaining two moving averages:

  • The first moment (mean of gradients):
mt=Ξ²1β‹…mtβˆ’1+(1βˆ’Ξ²1)β‹…gtm_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t
  • The second moment (uncentered variance of gradients):
vt=Ξ²2β‹…vtβˆ’1+(1βˆ’Ξ²2)β‹…gt2v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2

where:

  • mtm_t is the first moment estimate at time tt;
  • vtv_t is the second moment estimate at time tt;
  • Ξ²1\beta_1 and Ξ²2\beta_2 are decay rates for the moments (commonly 0.90.9 and 0.9990.999).

Adam also includes bias correction terms to counteract initialization effects:

m^t=mt1βˆ’Ξ²1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t} v^t=vt1βˆ’Ξ²2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

The parameter update step for Adam is:

ΞΈt+1=ΞΈtβˆ’Ξ·v^t+Ο΅β‹…m^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t

Adam combines the benefits of momentum (smoother updates) and adaptive learning rates (per-parameter scaling), often leading to faster and more robust convergence.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import torch import torch.nn as nn import torch.optim as optim import matplotlib.pyplot as plt # Example model def make_model(): return nn.Sequential( nn.Linear(32, 64), nn.ReLU(), nn.Linear(64, 10), nn.Softmax(dim=1) ) model_rmsprop = make_model() model_adam = make_model() # Configuring RMSProp optimizer rmsprop_optimizer = optim.RMSprop( model_rmsprop.parameters(), lr=0.001, # Learning rate alpha=0.9, # Smoothing constant momentum=0.0, # Momentum parameter eps=1e-7 # Small value to avoid division by zero ) # Configuring Adam optimizer adam_optimizer = optim.Adam( model_adam.parameters(), lr=0.001, # Learning rate betas=(0.9, 0.999), # Exponential decay rates eps=1e-7 # Small value to avoid division by zero ) loss_fn = nn.CrossEntropyLoss() # Store loss values for plotting losses_rmsprop = [] losses_adam = [] for i in range(2): # New random inputs and labels per iteration inputs = torch.randn(16, 32) labels = torch.randint(0, 10, (16,)) # RMSProp step outputs_rmsprop = model_rmsprop(inputs) loss_rmsprop = loss_fn(outputs_rmsprop, labels) loss_rmsprop.backward() rmsprop_optimizer.step() rmsprop_optimizer.zero_grad() losses_rmsprop.append(loss_rmsprop.item()) print(f"[RMSProp] Iteration {i+1}, Loss: {loss_rmsprop.item():.4f}") # Adam step outputs_adam = model_adam(inputs) loss_adam = loss_fn(outputs_adam, labels) loss_adam.backward() adam_optimizer.step() adam_optimizer.zero_grad() losses_adam.append(loss_adam.item()) print(f"[Adam] Iteration {i+1}, Loss: {loss_adam.item():.4f}")
copy

Adaptive optimizers such as RMSProp and Adam often outperform SGD in specific situations:

  • When your data is noisy;
  • When the loss surface is highly non-convex with many local minima and saddle points;
  • When gradients vary significantly across parameters.

In these cases, adaptive optimizers help your model escape poor regions of the loss surface more efficiently. They are also valuable when you have limited time for manual learning rate tuning, or when different parameters require different learning rates due to varying input scales or network depth.

However, while adaptive optimizers usually provide faster convergence and are more forgiving with hyperparameter choices, they can sometimes result in worse generalization compared to SGD with momentum, especially on large-scale supervised learning tasks.

question mark

Which of the following best describes a key advantage of adaptive optimizers like RMSProp and Adam over standard SGD?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2
some-alt