Convergence Proofs and Learning Rate Schedules
To rigorously understand why and when gradient descent converges, you need to examine the formal proof under convexity assumptions. Suppose you are minimizing a differentiable, convex function f:RdβR with Lipschitz-continuous gradients. This means there exists a constant L>0 such that for all x,y:
β£β£βf(x)ββf(y)β£β£β€Lβ£β£xβyβ£β£Assume you update parameters using the rule:
xk+1β=xkββΞ·βf(xkβ)where Ξ· is the learning rate. If you choose Ξ· such that 0<Ξ·<2/L, you can guarantee that the sequence xkβ converges to a global minimum.
Here is a sketch of the proof. By convexity and the Lipschitz condition, you can show:
f(xk+1β)β€f(xkβ)β(Ξ·β(LΞ·2)/2)β£β£βf(xkβ)β£β£2If Ξ· is chosen as above, the right side is strictly less than f(xkβ) whenever βf(xkβ)ξ =0. Thus, the function values decrease monotonically, and the gradients must approach zero as kββ. This ensures that the iterates approach a minimizer of f. The rate of convergence is sublinear, typically O(1/k) for general convex functions, and can be linear for strongly convex objectives.
A learning rate schedule is necessary because a fixed learning rate can cause gradient descent to oscillate or even diverge if set too high, or make convergence painfully slow if set too low. As you approach the minimum, smaller steps help you settle into the optimum without overshooting. Decaying the learning rate over time allows you to combine fast initial progress with fine-tuned convergence, adapting to the changing landscape of the loss surface.
12345678910111213141516171819202122232425262728293031323334353637import numpy as np import matplotlib.pyplot as plt # Define a simple convex quadratic function def f(x): return (x - 3) ** 2 + 2 def grad_f(x): return 2 * (x - 3) x0 = 10.0 # Initial point num_steps = 40 # Constant learning rate eta_const = 0.3 x_const = [x0] for _ in range(num_steps): x_new = x_const[-1] - eta_const * grad_f(x_const[-1]) x_const.append(x_new) # Decaying learning rate eta0 = 0.5 decay = 0.9 x_decay = [x0] for k in range(num_steps): eta = eta0 * (decay ** k) x_new = x_decay[-1] - eta * grad_f(x_decay[-1]) x_decay.append(x_new) steps = np.arange(num_steps + 1) plt.plot(steps, [f(x) for x in x_const], label="Constant LR") plt.plot(steps, [f(x) for x in x_decay], label="Decaying LR") plt.xlabel("Step") plt.ylabel("f(x)") plt.title("Convergence: Constant vs. Decaying Learning Rate") plt.legend() plt.show()
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 5.56
Convergence Proofs and Learning Rate Schedules
Swipe to show menu
To rigorously understand why and when gradient descent converges, you need to examine the formal proof under convexity assumptions. Suppose you are minimizing a differentiable, convex function f:RdβR with Lipschitz-continuous gradients. This means there exists a constant L>0 such that for all x,y:
β£β£βf(x)ββf(y)β£β£β€Lβ£β£xβyβ£β£Assume you update parameters using the rule:
xk+1β=xkββΞ·βf(xkβ)where Ξ· is the learning rate. If you choose Ξ· such that 0<Ξ·<2/L, you can guarantee that the sequence xkβ converges to a global minimum.
Here is a sketch of the proof. By convexity and the Lipschitz condition, you can show:
f(xk+1β)β€f(xkβ)β(Ξ·β(LΞ·2)/2)β£β£βf(xkβ)β£β£2If Ξ· is chosen as above, the right side is strictly less than f(xkβ) whenever βf(xkβ)ξ =0. Thus, the function values decrease monotonically, and the gradients must approach zero as kββ. This ensures that the iterates approach a minimizer of f. The rate of convergence is sublinear, typically O(1/k) for general convex functions, and can be linear for strongly convex objectives.
A learning rate schedule is necessary because a fixed learning rate can cause gradient descent to oscillate or even diverge if set too high, or make convergence painfully slow if set too low. As you approach the minimum, smaller steps help you settle into the optimum without overshooting. Decaying the learning rate over time allows you to combine fast initial progress with fine-tuned convergence, adapting to the changing landscape of the loss surface.
12345678910111213141516171819202122232425262728293031323334353637import numpy as np import matplotlib.pyplot as plt # Define a simple convex quadratic function def f(x): return (x - 3) ** 2 + 2 def grad_f(x): return 2 * (x - 3) x0 = 10.0 # Initial point num_steps = 40 # Constant learning rate eta_const = 0.3 x_const = [x0] for _ in range(num_steps): x_new = x_const[-1] - eta_const * grad_f(x_const[-1]) x_const.append(x_new) # Decaying learning rate eta0 = 0.5 decay = 0.9 x_decay = [x0] for k in range(num_steps): eta = eta0 * (decay ** k) x_new = x_decay[-1] - eta * grad_f(x_decay[-1]) x_decay.append(x_new) steps = np.arange(num_steps + 1) plt.plot(steps, [f(x) for x in x_const], label="Constant LR") plt.plot(steps, [f(x) for x in x_decay], label="Decaying LR") plt.xlabel("Step") plt.ylabel("f(x)") plt.title("Convergence: Constant vs. Decaying Learning Rate") plt.legend() plt.show()
Thanks for your feedback!