Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Adam Optimizer and Bias Correction | Adaptive Methods
Mathematics of Optimization in ML

bookAdam Optimizer and Bias Correction

Understanding the Adam optimizer begins with recognizing its role as an adaptive learning rate method that builds upon the strengths of both momentum and RMSProp. Adam stands for Adaptive Moment Estimation, and it combines the concept of maintaining exponentially decaying averages of past gradients (momentum) with squared gradients (adaptive scaling). The derivation of Adam proceeds step by step, making it clear how each component contributes to its performance.

Suppose you are optimizing parameters θθ of a model, and at each iteration tt, you compute the gradient gtg_t of the loss with respect to θθ. Adam maintains two moving averages:

  • The first moment estimate (mean of gradients): mt=β1mt1+(1β1)gt;m_t = β₁ * m_{t-1} + (1 - β₁) * g_t;
  • The second moment estimate (uncentered variance): vt=β2vt1+(1β2)(gt)2.v_t = β₂ * v_{t-1} + (1 - β₂) * (g_t)^2.

Here, β1β₁ and β2β₂ are decay rates for the moving averages, typically set to values like 0.90.9 and 0.9990.999 respectively. Both m0m_0 and v0v_0 are initialized at zero. However, since these moving averages start at zero, they are biased towards zero, especially during the initial steps. To correct this, Adam introduces bias-corrected estimates:

  • Bias-corrected first moment: m^t=mt/(1β1t);m̂_t = m_t / (1 - β₁^t);
  • Bias-corrected second moment: v^t=vt/(1β2t).v̂_t = v_t / (1 - β₂^t).

The actual parameter update rule then becomes:

θt+1=θtαm^t/(sqrt(v^t)+ε)θ_{t+1} = θ_t - α * m̂_t / (sqrt(v̂_t) + ε)

where αα is the step size (learning rate), and εε is a small constant to prevent division by zero (commonly 1e81e-8). This approach ensures that each parameter has its own adaptive learning rate, scaled by the historical magnitude of gradients.

Note
Note

Adam is widely favored for its robustness to noisy gradients, fast convergence, and minimal hyperparameter tuning. It adapts the learning rate for each parameter, making it effective for problems with sparse gradients or varying feature scales. Adam is commonly used in training deep neural networks, natural language processing models, and any scenario where fast, reliable convergence is needed with minimal manual adjustment.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss: f(x) = (x - 2)^2 def loss(x): return (x - 2) ** 2 def grad(x): return 2 * (x - 2) # Adam parameters alpha = 0.1 beta1 = 0.9 beta2 = 0.999 eps = 1e-8 steps = 50 # SGD initialization x_sgd = 8.0 sgd_traj = [x_sgd] # Adam initialization x_adam = 8.0 m = 0 v = 0 adam_traj = [x_adam] for t in range(1, steps + 1): # SGD update g = grad(x_sgd) x_sgd -= alpha * g sgd_traj.append(x_sgd) # Adam update g = grad(x_adam) m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * (g ** 2) m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) x_adam -= alpha * m_hat / (np.sqrt(v_hat) + eps) adam_traj.append(x_adam) # Plotting the trajectories x_vals = np.linspace(-2, 8, 100) y_vals = loss(x_vals) plt.figure(figsize=(8, 5)) plt.plot(x_vals, y_vals, label="Loss Surface f(x)") plt.plot(sgd_traj, [loss(x) for x in sgd_traj], "o-", label="SGD Path") plt.plot(adam_traj, [loss(x) for x in adam_traj], "s-", label="Adam Path") plt.xlabel("x") plt.ylabel("Loss") plt.title("Adam vs. SGD Parameter Updates") plt.legend() plt.grid(True) plt.show()
copy
question mark

Which of the following statements about Adam's update rule and bias correction are true?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 5. Kapitel 2

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Awesome!

Completion rate improved to 5.56

bookAdam Optimizer and Bias Correction

Swipe um das Menü anzuzeigen

Understanding the Adam optimizer begins with recognizing its role as an adaptive learning rate method that builds upon the strengths of both momentum and RMSProp. Adam stands for Adaptive Moment Estimation, and it combines the concept of maintaining exponentially decaying averages of past gradients (momentum) with squared gradients (adaptive scaling). The derivation of Adam proceeds step by step, making it clear how each component contributes to its performance.

Suppose you are optimizing parameters θθ of a model, and at each iteration tt, you compute the gradient gtg_t of the loss with respect to θθ. Adam maintains two moving averages:

  • The first moment estimate (mean of gradients): mt=β1mt1+(1β1)gt;m_t = β₁ * m_{t-1} + (1 - β₁) * g_t;
  • The second moment estimate (uncentered variance): vt=β2vt1+(1β2)(gt)2.v_t = β₂ * v_{t-1} + (1 - β₂) * (g_t)^2.

Here, β1β₁ and β2β₂ are decay rates for the moving averages, typically set to values like 0.90.9 and 0.9990.999 respectively. Both m0m_0 and v0v_0 are initialized at zero. However, since these moving averages start at zero, they are biased towards zero, especially during the initial steps. To correct this, Adam introduces bias-corrected estimates:

  • Bias-corrected first moment: m^t=mt/(1β1t);m̂_t = m_t / (1 - β₁^t);
  • Bias-corrected second moment: v^t=vt/(1β2t).v̂_t = v_t / (1 - β₂^t).

The actual parameter update rule then becomes:

θt+1=θtαm^t/(sqrt(v^t)+ε)θ_{t+1} = θ_t - α * m̂_t / (sqrt(v̂_t) + ε)

where αα is the step size (learning rate), and εε is a small constant to prevent division by zero (commonly 1e81e-8). This approach ensures that each parameter has its own adaptive learning rate, scaled by the historical magnitude of gradients.

Note
Note

Adam is widely favored for its robustness to noisy gradients, fast convergence, and minimal hyperparameter tuning. It adapts the learning rate for each parameter, making it effective for problems with sparse gradients or varying feature scales. Adam is commonly used in training deep neural networks, natural language processing models, and any scenario where fast, reliable convergence is needed with minimal manual adjustment.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss: f(x) = (x - 2)^2 def loss(x): return (x - 2) ** 2 def grad(x): return 2 * (x - 2) # Adam parameters alpha = 0.1 beta1 = 0.9 beta2 = 0.999 eps = 1e-8 steps = 50 # SGD initialization x_sgd = 8.0 sgd_traj = [x_sgd] # Adam initialization x_adam = 8.0 m = 0 v = 0 adam_traj = [x_adam] for t in range(1, steps + 1): # SGD update g = grad(x_sgd) x_sgd -= alpha * g sgd_traj.append(x_sgd) # Adam update g = grad(x_adam) m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * (g ** 2) m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) x_adam -= alpha * m_hat / (np.sqrt(v_hat) + eps) adam_traj.append(x_adam) # Plotting the trajectories x_vals = np.linspace(-2, 8, 100) y_vals = loss(x_vals) plt.figure(figsize=(8, 5)) plt.plot(x_vals, y_vals, label="Loss Surface f(x)") plt.plot(sgd_traj, [loss(x) for x in sgd_traj], "o-", label="SGD Path") plt.plot(adam_traj, [loss(x) for x in adam_traj], "s-", label="Adam Path") plt.xlabel("x") plt.ylabel("Loss") plt.title("Adam vs. SGD Parameter Updates") plt.legend() plt.grid(True) plt.show()
copy
question mark

Which of the following statements about Adam's update rule and bias correction are true?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 5. Kapitel 2
some-alt