Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Apprendre Taylor Expansion and Gradient Descent | Gradient Descent Mechanics
Mathematics of Optimization in ML

bookTaylor Expansion and Gradient Descent

To understand the mechanics of gradient descent, you need to see how Taylor expansion connects the gradient of a function to the steps you take during optimization. When you have a differentiable function f(x)f(x) that you want to minimize, you can use the Taylor expansion to approximate the function near a point xx. The first-order Taylor expansion of ff at xx for a small change ΔxΔx is:

f(x+Δx)f(x)+f(x)tΔxf(x + Δx) ≈ f(x) + ∇f(x)ᵗ Δx

Here, f(x)∇f(x) is the gradient of ff at xx, and ΔxΔx is a small vector representing how you move from xx. The term f(x)tΔx∇f(x)ᵗ Δx tells you how much ff will increase or decrease if you move in the direction of ΔxΔx.

To decrease the value of f(x)f(x), you want to choose ΔxΔx so that f(x+Δx)f(x + Δx) is less than f(x)f(x). The Taylor expansion shows that the change in ff is roughly f(x)tΔx∇f(x)ᵗ Δx. This means you should pick ΔxΔx so that this term is negative: move in the direction where the gradient's dot product with your step is negative. The steepest decrease happens when ΔxΔx points exactly opposite to the gradient, so you set:

Δx=αf(x)Δx = -α ∇f(x)

where αα is a small positive number called the learning rate. Plugging this back into the Taylor expansion gives:

f(xαf(x))f(x)αf(x)2f(x - α ∇f(x)) ≈ f(x) - α ||∇f(x)||²

This shows that moving a small step against the gradient reduces the function value, and the amount of decrease is proportional to the squared norm of the gradient.

The update rule for gradient descent is then:

xxαf(x)x ← x - α ∇f(x)

You repeat this step, updating xx each time, until the gradient is close to zero or you reach a stopping criterion.

Note
Note

Intuitive summary: the gradient points in the direction of steepest increase of the objective. By moving in the exact opposite direction, you take the fastest path downhill, ensuring that the objective decreases most rapidly for a small step. This is why gradient descent works: each update moves you toward a local minimum by following the negative gradient.

12345678910111213141516171819202122232425
import numpy as np # Define a simple quadratic function: f(x) = (x - 3)**2 def f(x): return (x - 3)**2 # Its gradient: f'(x) = 2*(x - 3) def grad_f(x): return 2 * (x - 3) # Start at x = 0 x = 0.0 learning_rate = 0.1 # Compute gradient at x gradient = grad_f(x) # Take one gradient descent step x_new = x - learning_rate * gradient print(f"Initial x: {x}") print(f"Gradient at x: {gradient}") print(f"Updated x after one step: {x_new}") print(f"Function value before: {f(x)}") print(f"Function value after: {f(x_new)}")
copy
question mark

Which statement best describes the gradient descent update rule as derived from the Taylor expansion?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 2. Chapitre 1

Demandez à l'IA

expand

Demandez à l'IA

ChatGPT

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Suggested prompts:

Can you explain why the gradient is negative in this example?

How does the learning rate affect the convergence of gradient descent?

Can you show how multiple steps of gradient descent would look for this function?

Awesome!

Completion rate improved to 5.56

bookTaylor Expansion and Gradient Descent

Glissez pour afficher le menu

To understand the mechanics of gradient descent, you need to see how Taylor expansion connects the gradient of a function to the steps you take during optimization. When you have a differentiable function f(x)f(x) that you want to minimize, you can use the Taylor expansion to approximate the function near a point xx. The first-order Taylor expansion of ff at xx for a small change ΔxΔx is:

f(x+Δx)f(x)+f(x)tΔxf(x + Δx) ≈ f(x) + ∇f(x)ᵗ Δx

Here, f(x)∇f(x) is the gradient of ff at xx, and ΔxΔx is a small vector representing how you move from xx. The term f(x)tΔx∇f(x)ᵗ Δx tells you how much ff will increase or decrease if you move in the direction of ΔxΔx.

To decrease the value of f(x)f(x), you want to choose ΔxΔx so that f(x+Δx)f(x + Δx) is less than f(x)f(x). The Taylor expansion shows that the change in ff is roughly f(x)tΔx∇f(x)ᵗ Δx. This means you should pick ΔxΔx so that this term is negative: move in the direction where the gradient's dot product with your step is negative. The steepest decrease happens when ΔxΔx points exactly opposite to the gradient, so you set:

Δx=αf(x)Δx = -α ∇f(x)

where αα is a small positive number called the learning rate. Plugging this back into the Taylor expansion gives:

f(xαf(x))f(x)αf(x)2f(x - α ∇f(x)) ≈ f(x) - α ||∇f(x)||²

This shows that moving a small step against the gradient reduces the function value, and the amount of decrease is proportional to the squared norm of the gradient.

The update rule for gradient descent is then:

xxαf(x)x ← x - α ∇f(x)

You repeat this step, updating xx each time, until the gradient is close to zero or you reach a stopping criterion.

Note
Note

Intuitive summary: the gradient points in the direction of steepest increase of the objective. By moving in the exact opposite direction, you take the fastest path downhill, ensuring that the objective decreases most rapidly for a small step. This is why gradient descent works: each update moves you toward a local minimum by following the negative gradient.

12345678910111213141516171819202122232425
import numpy as np # Define a simple quadratic function: f(x) = (x - 3)**2 def f(x): return (x - 3)**2 # Its gradient: f'(x) = 2*(x - 3) def grad_f(x): return 2 * (x - 3) # Start at x = 0 x = 0.0 learning_rate = 0.1 # Compute gradient at x gradient = grad_f(x) # Take one gradient descent step x_new = x - learning_rate * gradient print(f"Initial x: {x}") print(f"Gradient at x: {gradient}") print(f"Updated x after one step: {x_new}") print(f"Function value before: {f(x)}") print(f"Function value after: {f(x_new)}")
copy
question mark

Which statement best describes the gradient descent update rule as derived from the Taylor expansion?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 2. Chapitre 1
some-alt