Learning Rate and Convergence
Understanding the role of the learning rate in gradient descent is central to mastering optimization in machine learning. The learning rate, often denoted as η (eta), determines the size of each update step as you descend the loss surface toward a minimum. Mathematically, the update rule for a parameter θ can be written as:
θt+1=θt−η∇L(θt)where ∇L(θt) is the gradient of the loss function at the current parameter value. The choice of η directly impacts both how quickly you approach a minimum and whether you actually converge at all.
If the learning rate is too small, updates are tiny, and the algorithm converges very slowly. On the other hand, if the learning rate is too large, the parameter updates may overshoot the minimum or even diverge entirely, causing the loss to oscillate or increase. For a simple quadratic loss function L(θ)=aθ2+bθ+c, the stability of gradient descent can be analyzed by examining the eigenvalues of the Hessian matrix (for multidimensional cases) or by looking at the second derivative in one dimension. The update becomes unstable if the learning rate exceeds 2/L, where L is the Lipschitz constant of the gradient (often the largest eigenvalue of the Hessian). This condition ensures that each update moves you closer to the minimum rather than away from it.
In summary, the learning rate must be chosen such that it is less than a critical value determined by the curvature of the loss surface. This balance is crucial: too low and convergence is slow; too high and convergence may not occur at all.
Practical intuition: when tuning learning rates in machine learning, start with a small value (such as 0.01 or 0.001) and observe the learning curve. If convergence is slow, increase the learning rate gradually. If the loss spikes or oscillates, decrease it. Adaptive optimizers can help, but understanding the basic effect of learning rate helps you diagnose and fix optimization issues more effectively.
1234567891011121314151617181920212223242526272829303132333435import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) thetas = np.linspace(-1, 7, 200) losses = loss(thetas) learning_rates = [0.05, 0.3, 0.8] colors = ["blue", "orange", "red"] labels = ["Small (0.05)", "Moderate (0.3)", "Large (0.8)"] plt.figure(figsize=(10, 6)) plt.plot(thetas, losses, "k--", label="Loss surface") for lr, color, label in zip(learning_rates, colors, labels): theta = -1 path = [theta] for _ in range(15): theta = theta - lr * grad(theta) path.append(theta) path = np.array(path) plt.plot(path, loss(path), "o-", color=color, label=f"LR {label}") plt.xlabel("Theta") plt.ylabel("Loss") plt.title("Gradient Descent Paths with Different Learning Rates") plt.legend() plt.grid(True) plt.show()
Tak for dine kommentarer!
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Awesome!
Completion rate improved to 5.56
Learning Rate and Convergence
Stryg for at vise menuen
Understanding the role of the learning rate in gradient descent is central to mastering optimization in machine learning. The learning rate, often denoted as η (eta), determines the size of each update step as you descend the loss surface toward a minimum. Mathematically, the update rule for a parameter θ can be written as:
θt+1=θt−η∇L(θt)where ∇L(θt) is the gradient of the loss function at the current parameter value. The choice of η directly impacts both how quickly you approach a minimum and whether you actually converge at all.
If the learning rate is too small, updates are tiny, and the algorithm converges very slowly. On the other hand, if the learning rate is too large, the parameter updates may overshoot the minimum or even diverge entirely, causing the loss to oscillate or increase. For a simple quadratic loss function L(θ)=aθ2+bθ+c, the stability of gradient descent can be analyzed by examining the eigenvalues of the Hessian matrix (for multidimensional cases) or by looking at the second derivative in one dimension. The update becomes unstable if the learning rate exceeds 2/L, where L is the Lipschitz constant of the gradient (often the largest eigenvalue of the Hessian). This condition ensures that each update moves you closer to the minimum rather than away from it.
In summary, the learning rate must be chosen such that it is less than a critical value determined by the curvature of the loss surface. This balance is crucial: too low and convergence is slow; too high and convergence may not occur at all.
Practical intuition: when tuning learning rates in machine learning, start with a small value (such as 0.01 or 0.001) and observe the learning curve. If convergence is slow, increase the learning rate gradually. If the loss spikes or oscillates, decrease it. Adaptive optimizers can help, but understanding the basic effect of learning rate helps you diagnose and fix optimization issues more effectively.
1234567891011121314151617181920212223242526272829303132333435import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) thetas = np.linspace(-1, 7, 200) losses = loss(thetas) learning_rates = [0.05, 0.3, 0.8] colors = ["blue", "orange", "red"] labels = ["Small (0.05)", "Moderate (0.3)", "Large (0.8)"] plt.figure(figsize=(10, 6)) plt.plot(thetas, losses, "k--", label="Loss surface") for lr, color, label in zip(learning_rates, colors, labels): theta = -1 path = [theta] for _ in range(15): theta = theta - lr * grad(theta) path.append(theta) path = np.array(path) plt.plot(path, loss(path), "o-", color=color, label=f"LR {label}") plt.xlabel("Theta") plt.ylabel("Loss") plt.title("Gradient Descent Paths with Different Learning Rates") plt.legend() plt.grid(True) plt.show()
Tak for dine kommentarer!