Taylor Expansion and Gradient Descent
To understand the mechanics of gradient descent, you need to see how Taylor expansion connects the gradient of a function to the steps you take during optimization. When you have a differentiable function f(x) that you want to minimize, you can use the Taylor expansion to approximate the function near a point x. The first-order Taylor expansion of f at x for a small change Ξx is:
f(x+Ξx)βf(x)+βf(x)tΞxHere, βf(x) is the gradient of f at x, and Ξx is a small vector representing how you move from x. The term βf(x)tΞx tells you how much f will increase or decrease if you move in the direction of Ξx.
To decrease the value of f(x), you want to choose Ξx so that f(x+Ξx) is less than f(x). The Taylor expansion shows that the change in f is roughly βf(x)tΞx. This means you should pick Ξx so that this term is negative: move in the direction where the gradient's dot product with your step is negative. The steepest decrease happens when Ξx points exactly opposite to the gradient, so you set:
Ξx=βΞ±βf(x)where Ξ± is a small positive number called the learning rate. Plugging this back into the Taylor expansion gives:
f(xβΞ±βf(x))βf(x)βΞ±β£β£βf(x)β£β£2This shows that moving a small step against the gradient reduces the function value, and the amount of decrease is proportional to the squared norm of the gradient.
The update rule for gradient descent is then:
xβxβΞ±βf(x)You repeat this step, updating x each time, until the gradient is close to zero or you reach a stopping criterion.
Intuitive summary: the gradient points in the direction of steepest increase of the objective. By moving in the exact opposite direction, you take the fastest path downhill, ensuring that the objective decreases most rapidly for a small step. This is why gradient descent works: each update moves you toward a local minimum by following the negative gradient.
12345678910111213141516171819202122232425import numpy as np # Define a simple quadratic function: f(x) = (x - 3)**2 def f(x): return (x - 3)**2 # Its gradient: f'(x) = 2*(x - 3) def grad_f(x): return 2 * (x - 3) # Start at x = 0 x = 0.0 learning_rate = 0.1 # Compute gradient at x gradient = grad_f(x) # Take one gradient descent step x_new = x - learning_rate * gradient print(f"Initial x: {x}") print(f"Gradient at x: {gradient}") print(f"Updated x after one step: {x_new}") print(f"Function value before: {f(x)}") print(f"Function value after: {f(x_new)}")
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 5.56
Taylor Expansion and Gradient Descent
Swipe to show menu
To understand the mechanics of gradient descent, you need to see how Taylor expansion connects the gradient of a function to the steps you take during optimization. When you have a differentiable function f(x) that you want to minimize, you can use the Taylor expansion to approximate the function near a point x. The first-order Taylor expansion of f at x for a small change Ξx is:
f(x+Ξx)βf(x)+βf(x)tΞxHere, βf(x) is the gradient of f at x, and Ξx is a small vector representing how you move from x. The term βf(x)tΞx tells you how much f will increase or decrease if you move in the direction of Ξx.
To decrease the value of f(x), you want to choose Ξx so that f(x+Ξx) is less than f(x). The Taylor expansion shows that the change in f is roughly βf(x)tΞx. This means you should pick Ξx so that this term is negative: move in the direction where the gradient's dot product with your step is negative. The steepest decrease happens when Ξx points exactly opposite to the gradient, so you set:
Ξx=βΞ±βf(x)where Ξ± is a small positive number called the learning rate. Plugging this back into the Taylor expansion gives:
f(xβΞ±βf(x))βf(x)βΞ±β£β£βf(x)β£β£2This shows that moving a small step against the gradient reduces the function value, and the amount of decrease is proportional to the squared norm of the gradient.
The update rule for gradient descent is then:
xβxβΞ±βf(x)You repeat this step, updating x each time, until the gradient is close to zero or you reach a stopping criterion.
Intuitive summary: the gradient points in the direction of steepest increase of the objective. By moving in the exact opposite direction, you take the fastest path downhill, ensuring that the objective decreases most rapidly for a small step. This is why gradient descent works: each update moves you toward a local minimum by following the negative gradient.
12345678910111213141516171819202122232425import numpy as np # Define a simple quadratic function: f(x) = (x - 3)**2 def f(x): return (x - 3)**2 # Its gradient: f'(x) = 2*(x - 3) def grad_f(x): return 2 * (x - 3) # Start at x = 0 x = 0.0 learning_rate = 0.1 # Compute gradient at x gradient = grad_f(x) # Take one gradient descent step x_new = x - learning_rate * gradient print(f"Initial x: {x}") print(f"Gradient at x: {gradient}") print(f"Updated x after one step: {x_new}") print(f"Function value before: {f(x)}") print(f"Function value after: {f(x_new)}")
Thanks for your feedback!