Adaptive Update Behaviors
Adaptive optimization methods such as RMSProp, Adagrad, and Adam are designed to adjust the learning rate dynamically for each parameter during training. Unlike standard gradient descent, which uses a fixed or globally scheduled learning rate, adaptive methods use information from past gradients to modify the step size. This allows them to respond to the geometry of the loss surface, making larger updates for infrequently updated parameters and smaller updates for frequently updated ones. As a result, adaptive methods can help you converge faster and avoid problematic regions like sharp minima or plateaus.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687import numpy as np import matplotlib.pyplot as plt # Simulated gradient sequence np.random.seed(0) steps = 50 grads = np.random.randn(steps) * 0.2 + np.linspace(-0.5, 0.5, steps) # Initialize parameter and learning rate history param_sgd = [0] param_adagrad = [0] param_rmsprop = [0] param_adam = [0] lr_sgd = [0.1] * steps lr_adagrad = [] lr_rmsprop = [] lr_adam = [] # Adagrad variables eps = 1e-8 G = 0 # RMSProp variables decay = 0.9 E_grad2 = 0 # Adam variables m = 0 v = 0 beta1 = 0.9 beta2 = 0.999 # Learning rate base base_lr = 0.1 for t in range(steps): g = grads[t] # SGD param_sgd.append(param_sgd[-1] - lr_sgd[t] * g) # Adagrad G += g ** 2 lr_a = base_lr / (np.sqrt(G) + eps) lr_adagrad.append(lr_a) param_adagrad.append(param_adagrad[-1] - lr_a * g) # RMSProp E_grad2 = decay * E_grad2 + (1 - decay) * g ** 2 lr_r = base_lr / (np.sqrt(E_grad2) + eps) lr_rmsprop.append(lr_r) param_rmsprop.append(param_rmsprop[-1] - lr_r * g) # Adam m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * g ** 2 m_hat = m / (1 - beta1 ** (t + 1)) v_hat = v / (1 - beta2 ** (t + 1)) lr_ad = base_lr * (np.sqrt(1 - beta2 ** (t + 1)) / (1 - beta1 ** (t + 1))) lr_adam_t = lr_ad / (np.sqrt(v_hat) + eps) lr_adam.append(lr_adam_t) param_adam.append(param_adam[-1] - lr_adam_t * m_hat) fig, axs = plt.subplots(1, 2, figsize=(14, 5)) # Plot learning rates axs[0].plot(lr_adagrad, label="Adagrad") axs[0].plot(lr_rmsprop, label="RMSProp") axs[0].plot(lr_adam, label="Adam") axs[0].hlines(base_lr, 0, steps, colors='gray', linestyles='dashed', label="SGD (fixed)") axs[0].set_title("Adaptive Learning Rates Over Steps") axs[0].set_xlabel("Step") axs[0].set_ylabel("Learning Rate") axs[0].legend() # Plot parameter trajectories axs[1].plot(param_sgd, label="SGD") axs[1].plot(param_adagrad, label="Adagrad") axs[1].plot(param_rmsprop, label="RMSProp") axs[1].plot(param_adam, label="Adam") axs[1].set_title("Parameter Trajectories") axs[1].set_xlabel("Step") axs[1].set_ylabel("Parameter Value") axs[1].legend() plt.tight_layout() plt.show()
Adaptive methods often outperform standard gradient descent when the dataset has sparse features or when gradients vary significantly across parameters. They can automatically tune learning rates, making optimization more robust to hyperparameter choices and improving convergence speed, especially in deep learning scenarios where manual tuning is challenging.
Tak for dine kommentarer!
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Awesome!
Completion rate improved to 5.56
Adaptive Update Behaviors
Stryg for at vise menuen
Adaptive optimization methods such as RMSProp, Adagrad, and Adam are designed to adjust the learning rate dynamically for each parameter during training. Unlike standard gradient descent, which uses a fixed or globally scheduled learning rate, adaptive methods use information from past gradients to modify the step size. This allows them to respond to the geometry of the loss surface, making larger updates for infrequently updated parameters and smaller updates for frequently updated ones. As a result, adaptive methods can help you converge faster and avoid problematic regions like sharp minima or plateaus.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687import numpy as np import matplotlib.pyplot as plt # Simulated gradient sequence np.random.seed(0) steps = 50 grads = np.random.randn(steps) * 0.2 + np.linspace(-0.5, 0.5, steps) # Initialize parameter and learning rate history param_sgd = [0] param_adagrad = [0] param_rmsprop = [0] param_adam = [0] lr_sgd = [0.1] * steps lr_adagrad = [] lr_rmsprop = [] lr_adam = [] # Adagrad variables eps = 1e-8 G = 0 # RMSProp variables decay = 0.9 E_grad2 = 0 # Adam variables m = 0 v = 0 beta1 = 0.9 beta2 = 0.999 # Learning rate base base_lr = 0.1 for t in range(steps): g = grads[t] # SGD param_sgd.append(param_sgd[-1] - lr_sgd[t] * g) # Adagrad G += g ** 2 lr_a = base_lr / (np.sqrt(G) + eps) lr_adagrad.append(lr_a) param_adagrad.append(param_adagrad[-1] - lr_a * g) # RMSProp E_grad2 = decay * E_grad2 + (1 - decay) * g ** 2 lr_r = base_lr / (np.sqrt(E_grad2) + eps) lr_rmsprop.append(lr_r) param_rmsprop.append(param_rmsprop[-1] - lr_r * g) # Adam m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * g ** 2 m_hat = m / (1 - beta1 ** (t + 1)) v_hat = v / (1 - beta2 ** (t + 1)) lr_ad = base_lr * (np.sqrt(1 - beta2 ** (t + 1)) / (1 - beta1 ** (t + 1))) lr_adam_t = lr_ad / (np.sqrt(v_hat) + eps) lr_adam.append(lr_adam_t) param_adam.append(param_adam[-1] - lr_adam_t * m_hat) fig, axs = plt.subplots(1, 2, figsize=(14, 5)) # Plot learning rates axs[0].plot(lr_adagrad, label="Adagrad") axs[0].plot(lr_rmsprop, label="RMSProp") axs[0].plot(lr_adam, label="Adam") axs[0].hlines(base_lr, 0, steps, colors='gray', linestyles='dashed', label="SGD (fixed)") axs[0].set_title("Adaptive Learning Rates Over Steps") axs[0].set_xlabel("Step") axs[0].set_ylabel("Learning Rate") axs[0].legend() # Plot parameter trajectories axs[1].plot(param_sgd, label="SGD") axs[1].plot(param_adagrad, label="Adagrad") axs[1].plot(param_rmsprop, label="RMSProp") axs[1].plot(param_adam, label="Adam") axs[1].set_title("Parameter Trajectories") axs[1].set_xlabel("Step") axs[1].set_ylabel("Parameter Value") axs[1].legend() plt.tight_layout() plt.show()
Adaptive methods often outperform standard gradient descent when the dataset has sparse features or when gradients vary significantly across parameters. They can automatically tune learning rates, making optimization more robust to hyperparameter choices and improving convergence speed, especially in deep learning scenarios where manual tuning is challenging.
Tak for dine kommentarer!