Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Convergence Proofs and Learning Rate Schedules | Convergence and Theoretical Guarantees
Mathematics of Optimization in ML

bookConvergence Proofs and Learning Rate Schedules

To rigorously understand why and when gradient descent converges, you need to examine the formal proof under convexity assumptions. Suppose you are minimizing a differentiable, convex function f:RdRf: R^d → R with Lipschitz-continuous gradients. This means there exists a constant L>0L > 0 such that for all x,yx, y:

f(x)f(y)Lxy||∇f(x) - ∇f(y)|| ≤ L ||x - y||

Assume you update parameters using the rule:

xk+1=xkηf(xk)x_{k+1} = x_k - η ∇f(x_k)

where ηη is the learning rate. If you choose ηη such that 0<η<2/L0 < η < 2/L, you can guarantee that the sequence xk{x_k} converges to a global minimum.

Here is a sketch of the proof. By convexity and the Lipschitz condition, you can show:

f(xk+1)f(xk)(η(Lη2)/2)f(xk)2f(x_{k+1}) ≤ f(x_k) - (η - (Lη^2)/2) ||∇f(x_k)||^2

If ηη is chosen as above, the right side is strictly less than f(xk)f(x_k) whenever f(xk)0∇f(x_k) ≠ 0. Thus, the function values decrease monotonically, and the gradients must approach zero as kk → ∞. This ensures that the iterates approach a minimizer of ff. The rate of convergence is sublinear, typically O(1/k)O(1/k) for general convex functions, and can be linear for strongly convex objectives.

Note
Note

A learning rate schedule is necessary because a fixed learning rate can cause gradient descent to oscillate or even diverge if set too high, or make convergence painfully slow if set too low. As you approach the minimum, smaller steps help you settle into the optimum without overshooting. Decaying the learning rate over time allows you to combine fast initial progress with fine-tuned convergence, adapting to the changing landscape of the loss surface.

12345678910111213141516171819202122232425262728293031323334353637
import numpy as np import matplotlib.pyplot as plt # Define a simple convex quadratic function def f(x): return (x - 3) ** 2 + 2 def grad_f(x): return 2 * (x - 3) x0 = 10.0 # Initial point num_steps = 40 # Constant learning rate eta_const = 0.3 x_const = [x0] for _ in range(num_steps): x_new = x_const[-1] - eta_const * grad_f(x_const[-1]) x_const.append(x_new) # Decaying learning rate eta0 = 0.5 decay = 0.9 x_decay = [x0] for k in range(num_steps): eta = eta0 * (decay ** k) x_new = x_decay[-1] - eta * grad_f(x_decay[-1]) x_decay.append(x_new) steps = np.arange(num_steps + 1) plt.plot(steps, [f(x) for x in x_const], label="Constant LR") plt.plot(steps, [f(x) for x in x_decay], label="Decaying LR") plt.xlabel("Step") plt.ylabel("f(x)") plt.title("Convergence: Constant vs. Decaying Learning Rate") plt.legend() plt.show()
copy
question mark

Which of the following statements about gradient descent convergence and learning rate schedules are correct?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 6. Capítulo 1

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Awesome!

Completion rate improved to 5.56

bookConvergence Proofs and Learning Rate Schedules

Deslize para mostrar o menu

To rigorously understand why and when gradient descent converges, you need to examine the formal proof under convexity assumptions. Suppose you are minimizing a differentiable, convex function f:RdRf: R^d → R with Lipschitz-continuous gradients. This means there exists a constant L>0L > 0 such that for all x,yx, y:

f(x)f(y)Lxy||∇f(x) - ∇f(y)|| ≤ L ||x - y||

Assume you update parameters using the rule:

xk+1=xkηf(xk)x_{k+1} = x_k - η ∇f(x_k)

where ηη is the learning rate. If you choose ηη such that 0<η<2/L0 < η < 2/L, you can guarantee that the sequence xk{x_k} converges to a global minimum.

Here is a sketch of the proof. By convexity and the Lipschitz condition, you can show:

f(xk+1)f(xk)(η(Lη2)/2)f(xk)2f(x_{k+1}) ≤ f(x_k) - (η - (Lη^2)/2) ||∇f(x_k)||^2

If ηη is chosen as above, the right side is strictly less than f(xk)f(x_k) whenever f(xk)0∇f(x_k) ≠ 0. Thus, the function values decrease monotonically, and the gradients must approach zero as kk → ∞. This ensures that the iterates approach a minimizer of ff. The rate of convergence is sublinear, typically O(1/k)O(1/k) for general convex functions, and can be linear for strongly convex objectives.

Note
Note

A learning rate schedule is necessary because a fixed learning rate can cause gradient descent to oscillate or even diverge if set too high, or make convergence painfully slow if set too low. As you approach the minimum, smaller steps help you settle into the optimum without overshooting. Decaying the learning rate over time allows you to combine fast initial progress with fine-tuned convergence, adapting to the changing landscape of the loss surface.

12345678910111213141516171819202122232425262728293031323334353637
import numpy as np import matplotlib.pyplot as plt # Define a simple convex quadratic function def f(x): return (x - 3) ** 2 + 2 def grad_f(x): return 2 * (x - 3) x0 = 10.0 # Initial point num_steps = 40 # Constant learning rate eta_const = 0.3 x_const = [x0] for _ in range(num_steps): x_new = x_const[-1] - eta_const * grad_f(x_const[-1]) x_const.append(x_new) # Decaying learning rate eta0 = 0.5 decay = 0.9 x_decay = [x0] for k in range(num_steps): eta = eta0 * (decay ** k) x_new = x_decay[-1] - eta * grad_f(x_decay[-1]) x_decay.append(x_new) steps = np.arange(num_steps + 1) plt.plot(steps, [f(x) for x in x_const], label="Constant LR") plt.plot(steps, [f(x) for x in x_decay], label="Decaying LR") plt.xlabel("Step") plt.ylabel("f(x)") plt.title("Convergence: Constant vs. Decaying Learning Rate") plt.legend() plt.show()
copy
question mark

Which of the following statements about gradient descent convergence and learning rate schedules are correct?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 6. Capítulo 1
some-alt