Learning Rate and Convergence
Understanding the role of the learning rate in gradient descent is central to mastering optimization in machine learning. The learning rate, often denoted as η (eta), determines the size of each update step as you descend the loss surface toward a minimum. Mathematically, the update rule for a parameter θ can be written as:
θt+1=θt−η∇L(θt)where ∇L(θt) is the gradient of the loss function at the current parameter value. The choice of η directly impacts both how quickly you approach a minimum and whether you actually converge at all.
If the learning rate is too small, updates are tiny, and the algorithm converges very slowly. On the other hand, if the learning rate is too large, the parameter updates may overshoot the minimum or even diverge entirely, causing the loss to oscillate or increase. For a simple quadratic loss function L(θ)=aθ2+bθ+c, the stability of gradient descent can be analyzed by examining the eigenvalues of the Hessian matrix (for multidimensional cases) or by looking at the second derivative in one dimension. The update becomes unstable if the learning rate exceeds 2/L, where L is the Lipschitz constant of the gradient (often the largest eigenvalue of the Hessian). This condition ensures that each update moves you closer to the minimum rather than away from it.
In summary, the learning rate must be chosen such that it is less than a critical value determined by the curvature of the loss surface. This balance is crucial: too low and convergence is slow; too high and convergence may not occur at all.
Practical intuition: when tuning learning rates in machine learning, start with a small value (such as 0.01 or 0.001) and observe the learning curve. If convergence is slow, increase the learning rate gradually. If the loss spikes or oscillates, decrease it. Adaptive optimizers can help, but understanding the basic effect of learning rate helps you diagnose and fix optimization issues more effectively.
1234567891011121314151617181920212223242526272829303132333435import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) thetas = np.linspace(-1, 7, 200) losses = loss(thetas) learning_rates = [0.05, 0.3, 0.8] colors = ["blue", "orange", "red"] labels = ["Small (0.05)", "Moderate (0.3)", "Large (0.8)"] plt.figure(figsize=(10, 6)) plt.plot(thetas, losses, "k--", label="Loss surface") for lr, color, label in zip(learning_rates, colors, labels): theta = -1 path = [theta] for _ in range(15): theta = theta - lr * grad(theta) path.append(theta) path = np.array(path) plt.plot(path, loss(path), "o-", color=color, label=f"LR {label}") plt.xlabel("Theta") plt.ylabel("Loss") plt.title("Gradient Descent Paths with Different Learning Rates") plt.legend() plt.grid(True) plt.show()
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Can you explain how the choice of learning rate affects the paths shown in the plot?
What happens if I use a learning rate larger than 0.8 in this example?
Can you help me understand how to choose an appropriate learning rate for other types of loss functions?
Awesome!
Completion rate improved to 5.56
Learning Rate and Convergence
Deslize para mostrar o menu
Understanding the role of the learning rate in gradient descent is central to mastering optimization in machine learning. The learning rate, often denoted as η (eta), determines the size of each update step as you descend the loss surface toward a minimum. Mathematically, the update rule for a parameter θ can be written as:
θt+1=θt−η∇L(θt)where ∇L(θt) is the gradient of the loss function at the current parameter value. The choice of η directly impacts both how quickly you approach a minimum and whether you actually converge at all.
If the learning rate is too small, updates are tiny, and the algorithm converges very slowly. On the other hand, if the learning rate is too large, the parameter updates may overshoot the minimum or even diverge entirely, causing the loss to oscillate or increase. For a simple quadratic loss function L(θ)=aθ2+bθ+c, the stability of gradient descent can be analyzed by examining the eigenvalues of the Hessian matrix (for multidimensional cases) or by looking at the second derivative in one dimension. The update becomes unstable if the learning rate exceeds 2/L, where L is the Lipschitz constant of the gradient (often the largest eigenvalue of the Hessian). This condition ensures that each update moves you closer to the minimum rather than away from it.
In summary, the learning rate must be chosen such that it is less than a critical value determined by the curvature of the loss surface. This balance is crucial: too low and convergence is slow; too high and convergence may not occur at all.
Practical intuition: when tuning learning rates in machine learning, start with a small value (such as 0.01 or 0.001) and observe the learning curve. If convergence is slow, increase the learning rate gradually. If the loss spikes or oscillates, decrease it. Adaptive optimizers can help, but understanding the basic effect of learning rate helps you diagnose and fix optimization issues more effectively.
1234567891011121314151617181920212223242526272829303132333435import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) thetas = np.linspace(-1, 7, 200) losses = loss(thetas) learning_rates = [0.05, 0.3, 0.8] colors = ["blue", "orange", "red"] labels = ["Small (0.05)", "Moderate (0.3)", "Large (0.8)"] plt.figure(figsize=(10, 6)) plt.plot(thetas, losses, "k--", label="Loss surface") for lr, color, label in zip(learning_rates, colors, labels): theta = -1 path = [theta] for _ in range(15): theta = theta - lr * grad(theta) path.append(theta) path = np.array(path) plt.plot(path, loss(path), "o-", color=color, label=f"LR {label}") plt.xlabel("Theta") plt.ylabel("Loss") plt.title("Gradient Descent Paths with Different Learning Rates") plt.legend() plt.grid(True) plt.show()
Obrigado pelo seu feedback!