Learn RMSProp and Adam Optimizers | Optimization Algorithms in Practice

Swipe to show menu

RMSProp and Adam are two of the most widely used adaptive optimization algorithms in deep learning. Both optimizers address the limitations of basic stochastic gradient descent (SGD) and momentum, especially when training deep neural networks on complex, noisy, or non-stationary data. The key innovation in both RMSProp and Adam is their ability to adapt the learning rate for each parameter individually, based on the history of gradients.

RMSProp

RMSProp (Root Mean Square Propagation) maintains a moving average of the squared gradients for each parameter. The moving average for the squared gradient at time step $t$ is calculated as:

E[g^2]_t = \rho \cdot E[g^2]_{t-1} + (1 - \rho) \cdot g_t^2

where:

$E[g^2]_t$ is the moving average of the squared gradients at time $t$ ;
$\rho$ is the decay rate (commonly set to $0.9$ );
$g_t$ is the gradient at time $t$ .

The parameter update is then scaled by this moving average:

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t

where:

$\theta_t$ is the parameter at time $t$ ;
$\eta$ is the learning rate;
$\epsilon$ is a small constant to prevent division by zero.

This adaptive scaling helps stabilize training by normalizing updates: parameters with consistently large gradients receive smaller updates, while those with smaller or infrequent gradients receive relatively larger updates.

Adam

Adam (Adaptive Moment Estimation) builds upon RMSProp by maintaining two moving averages:

The first moment (mean of gradients):

m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t

The second moment (uncentered variance of gradients):

v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2

where:

$m_t$ is the first moment estimate at time $t$ ;
$v_t$ is the second moment estimate at time $t$ ;
$\beta_1$ and $\beta_2$ are decay rates for the moments (commonly $0.9$ and $0.999$ ).

Adam also includes bias correction terms to counteract initialization effects:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

The parameter update step for Adam is:

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t

Adam combines the benefits of momentum (smoother updates) and adaptive learning rates (per-parameter scaling), often leading to faster and more robust convergence.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
            
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

# Example model
def make_model():
    return nn.Sequential(
        nn.Linear(32, 64),
        nn.ReLU(),
        nn.Linear(64, 10),
        nn.Softmax(dim=1)
    )

model_rmsprop = make_model()
model_adam = make_model()

# Configuring RMSProp optimizer
rmsprop_optimizer = optim.RMSprop(
    model_rmsprop.parameters(),
    lr=0.001,         # Learning rate
    alpha=0.9,        # Smoothing constant
    momentum=0.0,     # Momentum parameter
    eps=1e-7          # Small value to avoid division by zero
)

# Configuring Adam optimizer
adam_optimizer = optim.Adam(
    model_adam.parameters(),
    lr=0.001,         # Learning rate
    betas=(0.9, 0.999), # Exponential decay rates
    eps=1e-7          # Small value to avoid division by zero
)

loss_fn = nn.CrossEntropyLoss()

# Store loss values for plotting
losses_rmsprop = []
losses_adam = []

for i in range(2):
    # New random inputs and labels per iteration
    inputs = torch.randn(16, 32)
    labels = torch.randint(0, 10, (16,))
    
    # RMSProp step
    outputs_rmsprop = model_rmsprop(inputs)
    loss_rmsprop = loss_fn(outputs_rmsprop, labels)
    loss_rmsprop.backward()
    rmsprop_optimizer.step()
    rmsprop_optimizer.zero_grad()
    losses_rmsprop.append(loss_rmsprop.item())
    print(f"[RMSProp] Iteration {i+1}, Loss: {loss_rmsprop.item():.4f}")
    
    # Adam step
    outputs_adam = model_adam(inputs)
    loss_adam = loss_fn(outputs_adam, labels)
    loss_adam.backward()
    adam_optimizer.step()
    adam_optimizer.zero_grad()
    losses_adam.append(loss_adam.item())
    print(f"[Adam]    Iteration {i+1}, Loss: {loss_adam.item():.4f}")

Adaptive optimizers such as RMSProp and Adam often outperform SGD in specific situations:

When your data is noisy;
When the loss surface is highly non-convex with many local minima and saddle points;
When gradients vary significantly across parameters.

In these cases, adaptive optimizers help your model escape poor regions of the loss surface more efficiently. They are also valuable when you have limited time for manual learning rate tuning, or when different parameters require different learning rates due to varying input scales or network depth.

However, while adaptive optimizers usually provide faster convergence and are more forgiving with hyperparameter choices, they can sometimes result in worse generalization compared to SGD with momentum, especially on large-scale supervised learning tasks.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 2