Adam Optimizer and Bias Correction
Understanding the Adam optimizer begins with recognizing its role as an adaptive learning rate method that builds upon the strengths of both momentum and RMSProp. Adam stands for Adaptive Moment Estimation, and it combines the concept of maintaining exponentially decaying averages of past gradients (momentum) with squared gradients (adaptive scaling). The derivation of Adam proceeds step by step, making it clear how each component contributes to its performance.
Suppose you are optimizing parameters ΞΈ of a model, and at each iteration t, you compute the gradient gtβ of the loss with respect to ΞΈ. Adam maintains two moving averages:
- The first moment estimate (mean of gradients): mtβ=Ξ²1ββmtβ1β+(1βΞ²1β)βgtβ;
- The second moment estimate (uncentered variance): vtβ=Ξ²2ββvtβ1β+(1βΞ²2β)β(gtβ)2.
Here, Ξ²1β and Ξ²2β are decay rates for the moving averages, typically set to values like 0.9 and 0.999 respectively. Both m0β and v0β are initialized at zero. However, since these moving averages start at zero, they are biased towards zero, especially during the initial steps. To correct this, Adam introduces bias-corrected estimates:
- Bias-corrected first moment: m^tβ=mtβ/(1βΞ²1tβ);
- Bias-corrected second moment: v^tβ=vtβ/(1βΞ²2tβ).
The actual parameter update rule then becomes:
ΞΈt+1β=ΞΈtββΞ±βm^tβ/(sqrt(v^tβ)+Ξ΅)where Ξ± is the step size (learning rate), and Ξ΅ is a small constant to prevent division by zero (commonly 1eβ8). This approach ensures that each parameter has its own adaptive learning rate, scaled by the historical magnitude of gradients.
Adam is widely favored for its robustness to noisy gradients, fast convergence, and minimal hyperparameter tuning. It adapts the learning rate for each parameter, making it effective for problems with sparse gradients or varying feature scales. Adam is commonly used in training deep neural networks, natural language processing models, and any scenario where fast, reliable convergence is needed with minimal manual adjustment.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss: f(x) = (x - 2)^2 def loss(x): return (x - 2) ** 2 def grad(x): return 2 * (x - 2) # Adam parameters alpha = 0.1 beta1 = 0.9 beta2 = 0.999 eps = 1e-8 steps = 50 # SGD initialization x_sgd = 8.0 sgd_traj = [x_sgd] # Adam initialization x_adam = 8.0 m = 0 v = 0 adam_traj = [x_adam] for t in range(1, steps + 1): # SGD update g = grad(x_sgd) x_sgd -= alpha * g sgd_traj.append(x_sgd) # Adam update g = grad(x_adam) m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * (g ** 2) m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) x_adam -= alpha * m_hat / (np.sqrt(v_hat) + eps) adam_traj.append(x_adam) # Plotting the trajectories x_vals = np.linspace(-2, 8, 100) y_vals = loss(x_vals) plt.figure(figsize=(8, 5)) plt.plot(x_vals, y_vals, label="Loss Surface f(x)") plt.plot(sgd_traj, [loss(x) for x in sgd_traj], "o-", label="SGD Path") plt.plot(adam_traj, [loss(x) for x in adam_traj], "s-", label="Adam Path") plt.xlabel("x") plt.ylabel("Loss") plt.title("Adam vs. SGD Parameter Updates") plt.legend() plt.grid(True) plt.show()
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 5.56
Adam Optimizer and Bias Correction
Swipe to show menu
Understanding the Adam optimizer begins with recognizing its role as an adaptive learning rate method that builds upon the strengths of both momentum and RMSProp. Adam stands for Adaptive Moment Estimation, and it combines the concept of maintaining exponentially decaying averages of past gradients (momentum) with squared gradients (adaptive scaling). The derivation of Adam proceeds step by step, making it clear how each component contributes to its performance.
Suppose you are optimizing parameters ΞΈ of a model, and at each iteration t, you compute the gradient gtβ of the loss with respect to ΞΈ. Adam maintains two moving averages:
- The first moment estimate (mean of gradients): mtβ=Ξ²1ββmtβ1β+(1βΞ²1β)βgtβ;
- The second moment estimate (uncentered variance): vtβ=Ξ²2ββvtβ1β+(1βΞ²2β)β(gtβ)2.
Here, Ξ²1β and Ξ²2β are decay rates for the moving averages, typically set to values like 0.9 and 0.999 respectively. Both m0β and v0β are initialized at zero. However, since these moving averages start at zero, they are biased towards zero, especially during the initial steps. To correct this, Adam introduces bias-corrected estimates:
- Bias-corrected first moment: m^tβ=mtβ/(1βΞ²1tβ);
- Bias-corrected second moment: v^tβ=vtβ/(1βΞ²2tβ).
The actual parameter update rule then becomes:
ΞΈt+1β=ΞΈtββΞ±βm^tβ/(sqrt(v^tβ)+Ξ΅)where Ξ± is the step size (learning rate), and Ξ΅ is a small constant to prevent division by zero (commonly 1eβ8). This approach ensures that each parameter has its own adaptive learning rate, scaled by the historical magnitude of gradients.
Adam is widely favored for its robustness to noisy gradients, fast convergence, and minimal hyperparameter tuning. It adapts the learning rate for each parameter, making it effective for problems with sparse gradients or varying feature scales. Adam is commonly used in training deep neural networks, natural language processing models, and any scenario where fast, reliable convergence is needed with minimal manual adjustment.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss: f(x) = (x - 2)^2 def loss(x): return (x - 2) ** 2 def grad(x): return 2 * (x - 2) # Adam parameters alpha = 0.1 beta1 = 0.9 beta2 = 0.999 eps = 1e-8 steps = 50 # SGD initialization x_sgd = 8.0 sgd_traj = [x_sgd] # Adam initialization x_adam = 8.0 m = 0 v = 0 adam_traj = [x_adam] for t in range(1, steps + 1): # SGD update g = grad(x_sgd) x_sgd -= alpha * g sgd_traj.append(x_sgd) # Adam update g = grad(x_adam) m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * (g ** 2) m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) x_adam -= alpha * m_hat / (np.sqrt(v_hat) + eps) adam_traj.append(x_adam) # Plotting the trajectories x_vals = np.linspace(-2, 8, 100) y_vals = loss(x_vals) plt.figure(figsize=(8, 5)) plt.plot(x_vals, y_vals, label="Loss Surface f(x)") plt.plot(sgd_traj, [loss(x) for x in sgd_traj], "o-", label="SGD Path") plt.plot(adam_traj, [loss(x) for x in adam_traj], "s-", label="Adam Path") plt.xlabel("x") plt.ylabel("Loss") plt.title("Adam vs. SGD Parameter Updates") plt.legend() plt.grid(True) plt.show()
Thanks for your feedback!