Adam Optimizer and Bias Correction
Understanding the Adam optimizer begins with recognizing its role as an adaptive learning rate method that builds upon the strengths of both momentum and RMSProp. Adam stands for Adaptive Moment Estimation, and it combines the concept of maintaining exponentially decaying averages of past gradients (momentum) with squared gradients (adaptive scaling). The derivation of Adam proceeds step by step, making it clear how each component contributes to its performance.
Suppose you are optimizing parameters θ of a model, and at each iteration t, you compute the gradient gt of the loss with respect to θ. Adam maintains two moving averages:
- The first moment estimate (mean of gradients): mt=β1∗mt−1+(1−β1)∗gt;
- The second moment estimate (uncentered variance): vt=β2∗vt−1+(1−β2)∗(gt)2.
Here, β1 and β2 are decay rates for the moving averages, typically set to values like 0.9 and 0.999 respectively. Both m0 and v0 are initialized at zero. However, since these moving averages start at zero, they are biased towards zero, especially during the initial steps. To correct this, Adam introduces bias-corrected estimates:
- Bias-corrected first moment: m^t=mt/(1−β1t);
- Bias-corrected second moment: v^t=vt/(1−β2t).
The actual parameter update rule then becomes:
θt+1=θt−α∗m^t/(sqrt(v^t)+ε)where α is the step size (learning rate), and ε is a small constant to prevent division by zero (commonly 1e−8). This approach ensures that each parameter has its own adaptive learning rate, scaled by the historical magnitude of gradients.
Adam is widely favored for its robustness to noisy gradients, fast convergence, and minimal hyperparameter tuning. It adapts the learning rate for each parameter, making it effective for problems with sparse gradients or varying feature scales. Adam is commonly used in training deep neural networks, natural language processing models, and any scenario where fast, reliable convergence is needed with minimal manual adjustment.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss: f(x) = (x - 2)^2 def loss(x): return (x - 2) ** 2 def grad(x): return 2 * (x - 2) # Adam parameters alpha = 0.1 beta1 = 0.9 beta2 = 0.999 eps = 1e-8 steps = 50 # SGD initialization x_sgd = 8.0 sgd_traj = [x_sgd] # Adam initialization x_adam = 8.0 m = 0 v = 0 adam_traj = [x_adam] for t in range(1, steps + 1): # SGD update g = grad(x_sgd) x_sgd -= alpha * g sgd_traj.append(x_sgd) # Adam update g = grad(x_adam) m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * (g ** 2) m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) x_adam -= alpha * m_hat / (np.sqrt(v_hat) + eps) adam_traj.append(x_adam) # Plotting the trajectories x_vals = np.linspace(-2, 8, 100) y_vals = loss(x_vals) plt.figure(figsize=(8, 5)) plt.plot(x_vals, y_vals, label="Loss Surface f(x)") plt.plot(sgd_traj, [loss(x) for x in sgd_traj], "o-", label="SGD Path") plt.plot(adam_traj, [loss(x) for x in adam_traj], "s-", label="Adam Path") plt.xlabel("x") plt.ylabel("Loss") plt.title("Adam vs. SGD Parameter Updates") plt.legend() plt.grid(True) plt.show()
Tack för dina kommentarer!
Fråga AI
Fråga AI
Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal
Awesome!
Completion rate improved to 5.56
Adam Optimizer and Bias Correction
Svep för att visa menyn
Understanding the Adam optimizer begins with recognizing its role as an adaptive learning rate method that builds upon the strengths of both momentum and RMSProp. Adam stands for Adaptive Moment Estimation, and it combines the concept of maintaining exponentially decaying averages of past gradients (momentum) with squared gradients (adaptive scaling). The derivation of Adam proceeds step by step, making it clear how each component contributes to its performance.
Suppose you are optimizing parameters θ of a model, and at each iteration t, you compute the gradient gt of the loss with respect to θ. Adam maintains two moving averages:
- The first moment estimate (mean of gradients): mt=β1∗mt−1+(1−β1)∗gt;
- The second moment estimate (uncentered variance): vt=β2∗vt−1+(1−β2)∗(gt)2.
Here, β1 and β2 are decay rates for the moving averages, typically set to values like 0.9 and 0.999 respectively. Both m0 and v0 are initialized at zero. However, since these moving averages start at zero, they are biased towards zero, especially during the initial steps. To correct this, Adam introduces bias-corrected estimates:
- Bias-corrected first moment: m^t=mt/(1−β1t);
- Bias-corrected second moment: v^t=vt/(1−β2t).
The actual parameter update rule then becomes:
θt+1=θt−α∗m^t/(sqrt(v^t)+ε)where α is the step size (learning rate), and ε is a small constant to prevent division by zero (commonly 1e−8). This approach ensures that each parameter has its own adaptive learning rate, scaled by the historical magnitude of gradients.
Adam is widely favored for its robustness to noisy gradients, fast convergence, and minimal hyperparameter tuning. It adapts the learning rate for each parameter, making it effective for problems with sparse gradients or varying feature scales. Adam is commonly used in training deep neural networks, natural language processing models, and any scenario where fast, reliable convergence is needed with minimal manual adjustment.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556import numpy as np import matplotlib.pyplot as plt # Define a simple quadratic loss: f(x) = (x - 2)^2 def loss(x): return (x - 2) ** 2 def grad(x): return 2 * (x - 2) # Adam parameters alpha = 0.1 beta1 = 0.9 beta2 = 0.999 eps = 1e-8 steps = 50 # SGD initialization x_sgd = 8.0 sgd_traj = [x_sgd] # Adam initialization x_adam = 8.0 m = 0 v = 0 adam_traj = [x_adam] for t in range(1, steps + 1): # SGD update g = grad(x_sgd) x_sgd -= alpha * g sgd_traj.append(x_sgd) # Adam update g = grad(x_adam) m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * (g ** 2) m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) x_adam -= alpha * m_hat / (np.sqrt(v_hat) + eps) adam_traj.append(x_adam) # Plotting the trajectories x_vals = np.linspace(-2, 8, 100) y_vals = loss(x_vals) plt.figure(figsize=(8, 5)) plt.plot(x_vals, y_vals, label="Loss Surface f(x)") plt.plot(sgd_traj, [loss(x) for x in sgd_traj], "o-", label="SGD Path") plt.plot(adam_traj, [loss(x) for x in adam_traj], "s-", label="Adam Path") plt.xlabel("x") plt.ylabel("Loss") plt.title("Adam vs. SGD Parameter Updates") plt.legend() plt.grid(True) plt.show()
Tack för dina kommentarer!