Impara Convergence Proofs and Learning Rate Schedules | Convergence and Theoretical Guarantees

To rigorously understand why and when gradient descent converges, you need to examine the formal proof under convexity assumptions. Suppose you are minimizing a differentiable, convex function $f: R^d → R$ with Lipschitz-continuous gradients. This means there exists a constant $L > 0$ such that for all $x, y$ :

||∇f(x) - ∇f(y)|| ≤ L ||x - y||

Assume you update parameters using the rule:

x_{k+1} = x_k - η ∇f(x_k)

where $η$ is the learning rate. If you choose $η$ such that $0 < η < 2/L$ , you can guarantee that the sequence ${x_k}$ converges to a global minimum.

Here is a sketch of the proof. By convexity and the Lipschitz condition, you can show:

f(x_{k+1}) ≤ f(x_k) - (η - (Lη^2)/2) ||∇f(x_k)||^2

If $η$ is chosen as above, the right side is strictly less than $f(x_k)$ whenever $∇f(x_k) ≠ 0$ . Thus, the function values decrease monotonically, and the gradients must approach zero as $k → ∞$ . This ensures that the iterates approach a minimizer of $f$ . The rate of convergence is sublinear, typically $O(1/k)$ for general convex functions, and can be linear for strongly convex objectives.

Note

A learning rate schedule is necessary because a fixed learning rate can cause gradient descent to oscillate or even diverge if set too high, or make convergence painfully slow if set too low. As you approach the minimum, smaller steps help you settle into the optimum without overshooting. Decaying the learning rate over time allows you to combine fast initial progress with fine-tuned convergence, adapting to the changing landscape of the loss surface.


              12345678910111213141516171819202122232425262728293031323334353637
            
import numpy as np
import matplotlib.pyplot as plt

# Define a simple convex quadratic function
def f(x):
    return (x - 3) ** 2 + 2

def grad_f(x):
    return 2 * (x - 3)

x0 = 10.0  # Initial point
num_steps = 40

# Constant learning rate
eta_const = 0.3
x_const = [x0]
for _ in range(num_steps):
    x_new = x_const[-1] - eta_const * grad_f(x_const[-1])
    x_const.append(x_new)

# Decaying learning rate
eta0 = 0.5
decay = 0.9
x_decay = [x0]
for k in range(num_steps):
    eta = eta0 * (decay ** k)
    x_new = x_decay[-1] - eta * grad_f(x_decay[-1])
    x_decay.append(x_new)

steps = np.arange(num_steps + 1)
plt.plot(steps, [f(x) for x in x_const], label="Constant LR")
plt.plot(steps, [f(x) for x in x_decay], label="Decaying LR")
plt.xlabel("Step")
plt.ylabel("f(x)")
plt.title("Convergence: Constant vs. Decaying Learning Rate")
plt.legend()
plt.show()

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 6. Capitolo 1

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Suggested prompts:

Can you explain why the learning rate must be less than 2/L for convergence?

What is the difference between sublinear and linear convergence rates?

How does the choice of learning rate affect the convergence behavior in practice?

Awesome!

Completion rate improved to 5.56

Scorri per mostrare il menu

||∇f(x) - ∇f(y)|| ≤ L ||x - y||