Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Learning Rate and Convergence | Gradient Descent Mechanics
Mathematics of Optimization in ML

bookLearning Rate and Convergence

Understanding the role of the learning rate in gradient descent is central to mastering optimization in machine learning. The learning rate, often denoted as ηη (eta), determines the size of each update step as you descend the loss surface toward a minimum. Mathematically, the update rule for a parameter θθ can be written as:

θt+1=θtηL(θt)θ_{t+1} = θ_t - η ∇L(θ_t)

where L(θt)∇L(θ_t) is the gradient of the loss function at the current parameter value. The choice of ηη directly impacts both how quickly you approach a minimum and whether you actually converge at all.

If the learning rate is too small, updates are tiny, and the algorithm converges very slowly. On the other hand, if the learning rate is too large, the parameter updates may overshoot the minimum or even diverge entirely, causing the loss to oscillate or increase. For a simple quadratic loss function L(θ)=aθ2+bθ+cL(θ) = aθ^2 + bθ + c, the stability of gradient descent can be analyzed by examining the eigenvalues of the Hessian matrix (for multidimensional cases) or by looking at the second derivative in one dimension. The update becomes unstable if the learning rate exceeds 2/L2 / L, where LL is the Lipschitz constant of the gradient (often the largest eigenvalue of the Hessian). This condition ensures that each update moves you closer to the minimum rather than away from it.

In summary, the learning rate must be chosen such that it is less than a critical value determined by the curvature of the loss surface. This balance is crucial: too low and convergence is slow; too high and convergence may not occur at all.

Note
Note

Practical intuition: when tuning learning rates in machine learning, start with a small value (such as 0.01 or 0.001) and observe the learning curve. If convergence is slow, increase the learning rate gradually. If the loss spikes or oscillates, decrease it. Adaptive optimizers can help, but understanding the basic effect of learning rate helps you diagnose and fix optimization issues more effectively.

1234567891011121314151617181920212223242526272829303132333435
import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) thetas = np.linspace(-1, 7, 200) losses = loss(thetas) learning_rates = [0.05, 0.3, 0.8] colors = ["blue", "orange", "red"] labels = ["Small (0.05)", "Moderate (0.3)", "Large (0.8)"] plt.figure(figsize=(10, 6)) plt.plot(thetas, losses, "k--", label="Loss surface") for lr, color, label in zip(learning_rates, colors, labels): theta = -1 path = [theta] for _ in range(15): theta = theta - lr * grad(theta) path.append(theta) path = np.array(path) plt.plot(path, loss(path), "o-", color=color, label=f"LR {label}") plt.xlabel("Theta") plt.ylabel("Loss") plt.title("Gradient Descent Paths with Different Learning Rates") plt.legend() plt.grid(True) plt.show()
copy
question mark

Which of the following statements best describes the effect of increasing the learning rate in gradient descent?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 2

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Awesome!

Completion rate improved to 5.56

bookLearning Rate and Convergence

Свайпніть щоб показати меню

Understanding the role of the learning rate in gradient descent is central to mastering optimization in machine learning. The learning rate, often denoted as ηη (eta), determines the size of each update step as you descend the loss surface toward a minimum. Mathematically, the update rule for a parameter θθ can be written as:

θt+1=θtηL(θt)θ_{t+1} = θ_t - η ∇L(θ_t)

where L(θt)∇L(θ_t) is the gradient of the loss function at the current parameter value. The choice of ηη directly impacts both how quickly you approach a minimum and whether you actually converge at all.

If the learning rate is too small, updates are tiny, and the algorithm converges very slowly. On the other hand, if the learning rate is too large, the parameter updates may overshoot the minimum or even diverge entirely, causing the loss to oscillate or increase. For a simple quadratic loss function L(θ)=aθ2+bθ+cL(θ) = aθ^2 + bθ + c, the stability of gradient descent can be analyzed by examining the eigenvalues of the Hessian matrix (for multidimensional cases) or by looking at the second derivative in one dimension. The update becomes unstable if the learning rate exceeds 2/L2 / L, where LL is the Lipschitz constant of the gradient (often the largest eigenvalue of the Hessian). This condition ensures that each update moves you closer to the minimum rather than away from it.

In summary, the learning rate must be chosen such that it is less than a critical value determined by the curvature of the loss surface. This balance is crucial: too low and convergence is slow; too high and convergence may not occur at all.

Note
Note

Practical intuition: when tuning learning rates in machine learning, start with a small value (such as 0.01 or 0.001) and observe the learning curve. If convergence is slow, increase the learning rate gradually. If the loss spikes or oscillates, decrease it. Adaptive optimizers can help, but understanding the basic effect of learning rate helps you diagnose and fix optimization issues more effectively.

1234567891011121314151617181920212223242526272829303132333435
import numpy as np import matplotlib.pyplot as plt # Quadratic loss: L(theta) = (theta - 3)^2 def loss(theta): return (theta - 3) ** 2 def grad(theta): return 2 * (theta - 3) thetas = np.linspace(-1, 7, 200) losses = loss(thetas) learning_rates = [0.05, 0.3, 0.8] colors = ["blue", "orange", "red"] labels = ["Small (0.05)", "Moderate (0.3)", "Large (0.8)"] plt.figure(figsize=(10, 6)) plt.plot(thetas, losses, "k--", label="Loss surface") for lr, color, label in zip(learning_rates, colors, labels): theta = -1 path = [theta] for _ in range(15): theta = theta - lr * grad(theta) path.append(theta) path = np.array(path) plt.plot(path, loss(path), "o-", color=color, label=f"LR {label}") plt.xlabel("Theta") plt.ylabel("Loss") plt.title("Gradient Descent Paths with Different Learning Rates") plt.legend() plt.grid(True) plt.show()
copy
question mark

Which of the following statements best describes the effect of increasing the learning rate in gradient descent?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 2
some-alt