Convergence Proofs and Learning Rate Schedules
To rigorously understand why and when gradient descent converges, you need to examine the formal proof under convexity assumptions. Suppose you are minimizing a differentiable, convex function f:Rd→R with Lipschitz-continuous gradients. This means there exists a constant L>0 such that for all x,y:
∣∣∇f(x)−∇f(y)∣∣≤L∣∣x−y∣∣Assume you update parameters using the rule:
xk+1=xk−η∇f(xk)where η is the learning rate. If you choose η such that 0<η<2/L, you can guarantee that the sequence xk converges to a global minimum.
Here is a sketch of the proof. By convexity and the Lipschitz condition, you can show:
f(xk+1)≤f(xk)−(η−(Lη2)/2)∣∣∇f(xk)∣∣2If η is chosen as above, the right side is strictly less than f(xk) whenever ∇f(xk)=0. Thus, the function values decrease monotonically, and the gradients must approach zero as k→∞. This ensures that the iterates approach a minimizer of f. The rate of convergence is sublinear, typically O(1/k) for general convex functions, and can be linear for strongly convex objectives.
A learning rate schedule is necessary because a fixed learning rate can cause gradient descent to oscillate or even diverge if set too high, or make convergence painfully slow if set too low. As you approach the minimum, smaller steps help you settle into the optimum without overshooting. Decaying the learning rate over time allows you to combine fast initial progress with fine-tuned convergence, adapting to the changing landscape of the loss surface.
12345678910111213141516171819202122232425262728293031323334353637import numpy as np import matplotlib.pyplot as plt # Define a simple convex quadratic function def f(x): return (x - 3) ** 2 + 2 def grad_f(x): return 2 * (x - 3) x0 = 10.0 # Initial point num_steps = 40 # Constant learning rate eta_const = 0.3 x_const = [x0] for _ in range(num_steps): x_new = x_const[-1] - eta_const * grad_f(x_const[-1]) x_const.append(x_new) # Decaying learning rate eta0 = 0.5 decay = 0.9 x_decay = [x0] for k in range(num_steps): eta = eta0 * (decay ** k) x_new = x_decay[-1] - eta * grad_f(x_decay[-1]) x_decay.append(x_new) steps = np.arange(num_steps + 1) plt.plot(steps, [f(x) for x in x_const], label="Constant LR") plt.plot(steps, [f(x) for x in x_decay], label="Decaying LR") plt.xlabel("Step") plt.ylabel("f(x)") plt.title("Convergence: Constant vs. Decaying Learning Rate") plt.legend() plt.show()
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione
Can you explain why the learning rate must be less than 2/L for convergence?
What is the difference between sublinear and linear convergence rates?
How does the choice of learning rate affect the convergence behavior in practice?
Awesome!
Completion rate improved to 5.56
Convergence Proofs and Learning Rate Schedules
Scorri per mostrare il menu
To rigorously understand why and when gradient descent converges, you need to examine the formal proof under convexity assumptions. Suppose you are minimizing a differentiable, convex function f:Rd→R with Lipschitz-continuous gradients. This means there exists a constant L>0 such that for all x,y:
∣∣∇f(x)−∇f(y)∣∣≤L∣∣x−y∣∣Assume you update parameters using the rule:
xk+1=xk−η∇f(xk)where η is the learning rate. If you choose η such that 0<η<2/L, you can guarantee that the sequence xk converges to a global minimum.
Here is a sketch of the proof. By convexity and the Lipschitz condition, you can show:
f(xk+1)≤f(xk)−(η−(Lη2)/2)∣∣∇f(xk)∣∣2If η is chosen as above, the right side is strictly less than f(xk) whenever ∇f(xk)=0. Thus, the function values decrease monotonically, and the gradients must approach zero as k→∞. This ensures that the iterates approach a minimizer of f. The rate of convergence is sublinear, typically O(1/k) for general convex functions, and can be linear for strongly convex objectives.
A learning rate schedule is necessary because a fixed learning rate can cause gradient descent to oscillate or even diverge if set too high, or make convergence painfully slow if set too low. As you approach the minimum, smaller steps help you settle into the optimum without overshooting. Decaying the learning rate over time allows you to combine fast initial progress with fine-tuned convergence, adapting to the changing landscape of the loss surface.
12345678910111213141516171819202122232425262728293031323334353637import numpy as np import matplotlib.pyplot as plt # Define a simple convex quadratic function def f(x): return (x - 3) ** 2 + 2 def grad_f(x): return 2 * (x - 3) x0 = 10.0 # Initial point num_steps = 40 # Constant learning rate eta_const = 0.3 x_const = [x0] for _ in range(num_steps): x_new = x_const[-1] - eta_const * grad_f(x_const[-1]) x_const.append(x_new) # Decaying learning rate eta0 = 0.5 decay = 0.9 x_decay = [x0] for k in range(num_steps): eta = eta0 * (decay ** k) x_new = x_decay[-1] - eta * grad_f(x_decay[-1]) x_decay.append(x_new) steps = np.arange(num_steps + 1) plt.plot(steps, [f(x) for x in x_const], label="Constant LR") plt.plot(steps, [f(x) for x in x_decay], label="Decaying LR") plt.xlabel("Step") plt.ylabel("f(x)") plt.title("Convergence: Constant vs. Decaying Learning Rate") plt.legend() plt.show()
Grazie per i tuoi commenti!