Learn Adam Optimizer and Bias Correction

Understanding the Adam optimizer begins with recognizing its role as an adaptive learning rate method that builds upon the strengths of both momentum and RMSProp. Adam stands for Adaptive Moment Estimation, and it combines the concept of maintaining exponentially decaying averages of past gradients (momentum) with squared gradients (adaptive scaling). The derivation of Adam proceeds step by step, making it clear how each component contributes to its performance.

Suppose you are optimizing parameters $θ$ of a model, and at each iteration $t$ , you compute the gradient $g_t$ of the loss with respect to $θ$ . Adam maintains two moving averages:

The first moment estimate (mean of gradients): $m_t = β₁ * m_{t-1} + (1 - β₁) * g_t;$
The second moment estimate (uncentered variance): $v_t = β₂ * v_{t-1} + (1 - β₂) * (g_t)^2.$

Here, $β₁$ and $β₂$ are decay rates for the moving averages, typically set to values like $0.9$ and $0.999$ respectively. Both $m_0$ and $v_0$ are initialized at zero. However, since these moving averages start at zero, they are biased towards zero, especially during the initial steps. To correct this, Adam introduces bias-corrected estimates:

Bias-corrected first moment: $m̂_t = m_t / (1 - β₁^t);$
Bias-corrected second moment: $v̂_t = v_t / (1 - β₂^t).$

The actual parameter update rule then becomes:

θ_{t+1} = θ_t - α * m̂_t / (sqrt(v̂_t) + ε)

where $α$ is the step size (learning rate), and $ε$ is a small constant to prevent division by zero (commonly $1e-8$ ). This approach ensures that each parameter has its own adaptive learning rate, scaled by the historical magnitude of gradients.

Note

Adam is widely favored for its robustness to noisy gradients, fast convergence, and minimal hyperparameter tuning. It adapts the learning rate for each parameter, making it effective for problems with sparse gradients or varying feature scales. Adam is commonly used in training deep neural networks, natural language processing models, and any scenario where fast, reliable convergence is needed with minimal manual adjustment.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
            
import numpy as np
import matplotlib.pyplot as plt

# Define a simple quadratic loss: f(x) = (x - 2)^2
def loss(x):
    return (x - 2) ** 2

def grad(x):
    return 2 * (x - 2)

# Adam parameters
alpha = 0.1
beta1 = 0.9
beta2 = 0.999
eps = 1e-8
steps = 50

# SGD initialization
x_sgd = 8.0
sgd_traj = [x_sgd]

# Adam initialization
x_adam = 8.0
m = 0
v = 0
adam_traj = [x_adam]

for t in range(1, steps + 1):
    # SGD update
    g = grad(x_sgd)
    x_sgd -= alpha * g
    sgd_traj.append(x_sgd)
    
    # Adam update
    g = grad(x_adam)
    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * (g ** 2)
    m_hat = m / (1 - beta1 ** t)
    v_hat = v / (1 - beta2 ** t)
    x_adam -= alpha * m_hat / (np.sqrt(v_hat) + eps)
    adam_traj.append(x_adam)

# Plotting the trajectories
x_vals = np.linspace(-2, 8, 100)
y_vals = loss(x_vals)

plt.figure(figsize=(8, 5))
plt.plot(x_vals, y_vals, label="Loss Surface f(x)")
plt.plot(sgd_traj, [loss(x) for x in sgd_traj], "o-", label="SGD Path")
plt.plot(adam_traj, [loss(x) for x in adam_traj], "s-", label="Adam Path")
plt.xlabel("x")
plt.ylabel("Loss")
plt.title("Adam vs. SGD Parameter Updates")
plt.legend()
plt.grid(True)
plt.show()

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu