Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Adam Optimizer and Bias Correction | Adaptive Methods
Mathematics of Optimization in ML

bookAdam Optimizer and Bias Correction

Understanding the Adam optimizer begins with recognizing its role as an adaptive learning rate method that builds upon the strengths of both momentum and RMSProp. Adam stands for Adaptive Moment Estimation, and it combines the concept of maintaining exponentially decaying averages of past gradients (momentum) with squared gradients (adaptive scaling). The derivation of Adam proceeds step by step, making it clear how each component contributes to its performance.

Suppose you are optimizing parameters θθ of a model, and at each iteration tt, you compute the gradient gtg_t of the loss with respect to θθ. Adam maintains two moving averages:

  • The first moment estimate (mean of gradients): mt=β1mt1+(1β1)gt;m_t = β₁ * m_{t-1} + (1 - β₁) * g_t;
  • The second moment estimate (uncentered variance): vt=β2vt1+(1β2)(gt)2.v_t = β₂ * v_{t-1} + (1 - β₂) * (g_t)^2.

Here, β1β₁ and β2β₂ are decay rates for the moving averages, typically set to values like 0.90.9 and 0.9990.999 respectively. Both m0m_0 and v0v_0 are initialized at zero. However, since these moving averages start at zero, they are biased towards zero, especially during the initial steps. To correct this, Adam introduces bias-corrected estimates:

  • Bias-corrected first moment: m^t=mt/(1β1t);m̂_t = m_t / (1 - β₁^t);
  • Bias-corrected second moment: v^t=vt/(1β2t).v̂_t = v_t / (1 - β₂^t).

The actual parameter update rule then becomes:

θt+1=θtαm^t/(sqrt(v^t)+ε)θ_{t+1} = θ_t - α * m̂_t / (sqrt(v̂_t) + ε)

where αα is the step size (learning rate), and εε is a small constant to prevent division by zero (commonly 1e81e-8). This approach ensures that each parameter has its own adaptive learning rate, scaled by the historical magnitude of gradients.

Note
Note

Adam is widely favored for its robustness to noisy gradients, fast convergence, and minimal hyperparameter tuning. It adapts the learning rate for each parameter, making it effective for problems with sparse gradients or varying feature scales. Adam is commonly used in training deep neural networks, natural language processing models, and any scenario where fast, reliable convergence is needed with minimal manual adjustment.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss: f(x) = (x - 2)^2 def loss(x): return (x - 2) ** 2 def grad(x): return 2 * (x - 2) # Adam parameters alpha = 0.1 beta1 = 0.9 beta2 = 0.999 eps = 1e-8 steps = 50 # SGD initialization x_sgd = 8.0 sgd_traj = [x_sgd] # Adam initialization x_adam = 8.0 m = 0 v = 0 adam_traj = [x_adam] for t in range(1, steps + 1): # SGD update g = grad(x_sgd) x_sgd -= alpha * g sgd_traj.append(x_sgd) # Adam update g = grad(x_adam) m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * (g ** 2) m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) x_adam -= alpha * m_hat / (np.sqrt(v_hat) + eps) adam_traj.append(x_adam) # Plotting the trajectories x_vals = np.linspace(-2, 8, 100) y_vals = loss(x_vals) plt.figure(figsize=(8, 5)) plt.plot(x_vals, y_vals, label="Loss Surface f(x)") plt.plot(sgd_traj, [loss(x) for x in sgd_traj], "o-", label="SGD Path") plt.plot(adam_traj, [loss(x) for x in adam_traj], "s-", label="Adam Path") plt.xlabel("x") plt.ylabel("Loss") plt.title("Adam vs. SGD Parameter Updates") plt.legend() plt.grid(True) plt.show()
copy
question mark

Which of the following statements about Adam's update rule and bias correction are true?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 5. Розділ 2

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Suggested prompts:

Can you explain the main differences between Adam and SGD based on this example?

How do the bias-corrected estimates improve Adam's performance?

Can you walk me through the parameter update steps in the code?

Awesome!

Completion rate improved to 5.56

bookAdam Optimizer and Bias Correction

Свайпніть щоб показати меню

Understanding the Adam optimizer begins with recognizing its role as an adaptive learning rate method that builds upon the strengths of both momentum and RMSProp. Adam stands for Adaptive Moment Estimation, and it combines the concept of maintaining exponentially decaying averages of past gradients (momentum) with squared gradients (adaptive scaling). The derivation of Adam proceeds step by step, making it clear how each component contributes to its performance.

Suppose you are optimizing parameters θθ of a model, and at each iteration tt, you compute the gradient gtg_t of the loss with respect to θθ. Adam maintains two moving averages:

  • The first moment estimate (mean of gradients): mt=β1mt1+(1β1)gt;m_t = β₁ * m_{t-1} + (1 - β₁) * g_t;
  • The second moment estimate (uncentered variance): vt=β2vt1+(1β2)(gt)2.v_t = β₂ * v_{t-1} + (1 - β₂) * (g_t)^2.

Here, β1β₁ and β2β₂ are decay rates for the moving averages, typically set to values like 0.90.9 and 0.9990.999 respectively. Both m0m_0 and v0v_0 are initialized at zero. However, since these moving averages start at zero, they are biased towards zero, especially during the initial steps. To correct this, Adam introduces bias-corrected estimates:

  • Bias-corrected first moment: m^t=mt/(1β1t);m̂_t = m_t / (1 - β₁^t);
  • Bias-corrected second moment: v^t=vt/(1β2t).v̂_t = v_t / (1 - β₂^t).

The actual parameter update rule then becomes:

θt+1=θtαm^t/(sqrt(v^t)+ε)θ_{t+1} = θ_t - α * m̂_t / (sqrt(v̂_t) + ε)

where αα is the step size (learning rate), and εε is a small constant to prevent division by zero (commonly 1e81e-8). This approach ensures that each parameter has its own adaptive learning rate, scaled by the historical magnitude of gradients.

Note
Note

Adam is widely favored for its robustness to noisy gradients, fast convergence, and minimal hyperparameter tuning. It adapts the learning rate for each parameter, making it effective for problems with sparse gradients or varying feature scales. Adam is commonly used in training deep neural networks, natural language processing models, and any scenario where fast, reliable convergence is needed with minimal manual adjustment.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss: f(x) = (x - 2)^2 def loss(x): return (x - 2) ** 2 def grad(x): return 2 * (x - 2) # Adam parameters alpha = 0.1 beta1 = 0.9 beta2 = 0.999 eps = 1e-8 steps = 50 # SGD initialization x_sgd = 8.0 sgd_traj = [x_sgd] # Adam initialization x_adam = 8.0 m = 0 v = 0 adam_traj = [x_adam] for t in range(1, steps + 1): # SGD update g = grad(x_sgd) x_sgd -= alpha * g sgd_traj.append(x_sgd) # Adam update g = grad(x_adam) m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * (g ** 2) m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) x_adam -= alpha * m_hat / (np.sqrt(v_hat) + eps) adam_traj.append(x_adam) # Plotting the trajectories x_vals = np.linspace(-2, 8, 100) y_vals = loss(x_vals) plt.figure(figsize=(8, 5)) plt.plot(x_vals, y_vals, label="Loss Surface f(x)") plt.plot(sgd_traj, [loss(x) for x in sgd_traj], "o-", label="SGD Path") plt.plot(adam_traj, [loss(x) for x in adam_traj], "s-", label="Adam Path") plt.xlabel("x") plt.ylabel("Loss") plt.title("Adam vs. SGD Parameter Updates") plt.legend() plt.grid(True) plt.show()
copy
question mark

Which of the following statements about Adam's update rule and bias correction are true?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 5. Розділ 2
some-alt