Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Gradient Descent | Mathematical Analysis
Mathematics for Data Science

bookGradient Descent

Note
Definition

Gradient Descent is an optimization algorithm that minimizes a function by iteratively adjusting its parameters in the direction of the steepest decrease. It is fundamental in machine learning for enabling models to learn efficiently from data.

Understanding Gradients

The gradient of a function represents the direction and steepness of the function at a given point. It tells us which way to move to minimize the function.

For a simple function:

J(ΞΈ)=ΞΈ2J(\theta) = \theta^2

The derivative (gradient) is:

βˆ‡J(ΞΈ)=ddΞΈ(ΞΈ2)=2ΞΈ\nabla J(\theta) = \frac{d}{d \theta}\left(\theta^2\right)= 2\theta

This means that for any value of ΞΈΞΈ, the gradient tells us how to adjust ΞΈΞΈ to descend toward the minimum.

Gradient Descent Formula

The weight update rule is:

ΞΈβ†ΞΈβˆ’Ξ±βˆ‡J(ΞΈ)\theta \larr \theta - \alpha \nabla J(\theta)

Where:

  • ΞΈ\theta - model parameter;
  • Ξ±\alpha - learning rate (step size);
  • βˆ‡J(ΞΈ)\nabla J(\theta) - gradient of the function we're aiming to minimize.

For our function:

ΞΈnew=ΞΈoldβˆ’Ξ±(2ΞΈold)\theta_{\text{new}} = \theta_{\text{old}} - \alpha\left(2\theta_{old}\right)

This means we update ΞΈΞΈ iteratively by subtracting the scaled gradient.

Stepwise Movement – A Visual

Example with start values: ΞΈ=3\theta = 3, Ξ±=0.3\alpha = 0.3

  1. ΞΈ1=3βˆ’0.3(2Γ—3)=3βˆ’1.8=1.2;\theta_1 = 3 - 0.3(2 \times 3) = 3 - 1.8 = 1.2;
  2. ΞΈ2=1.2βˆ’0.3(2Γ—1.2)=1.2βˆ’0.72=0.48;\theta_2 = 1.2 - 0.3(2 \times 1.2) = 1.2 - 0.72 = 0.48;
  3. ΞΈ3=0.48βˆ’0.3(2Γ—0.48)=0.48βˆ’0.288=0.192;\theta_3 = 0.48 - 0.3(2\times0.48) = 0.48 - 0.288 = 0.192;
  4. ΞΈ4=0.192βˆ’0.3(2Γ—0.192)=0.192βˆ’0.115=0.077.\theta_4 = 0.192 - 0.3(2 \times 0.192) = 0.192 - 0.115 = 0.077.

After a few iterations, we move toward ΞΈ=0ΞΈ=0, the minimum.

Learning Rate – Choosing Ξ± Wisely

  • Too large Β Ξ±\ \alpha - overshoots, never converges;
  • Too small Β Ξ±\ \alpha - converges too slowly;
  • Optimal Β Ξ±\ \alpha - balances speed & accuracy.

When Does Gradient Descent Stop?

Gradient descent stops when:

βˆ‡J(ΞΈ)β‰ˆ0\nabla J (\theta) \approx 0

This means that further updates are insignificant and we've found a minimum.

question mark

If the gradient βˆ‡J(ΞΈ)βˆ‡J(ΞΈ) is zero, what does this mean?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 9

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how to choose a good learning rate?

What happens if the gradient never reaches zero?

Can you show a real-world example where gradient descent is used?

Awesome!

Completion rate improved to 1.96

bookGradient Descent

Swipe to show menu

Note
Definition

Gradient Descent is an optimization algorithm that minimizes a function by iteratively adjusting its parameters in the direction of the steepest decrease. It is fundamental in machine learning for enabling models to learn efficiently from data.

Understanding Gradients

The gradient of a function represents the direction and steepness of the function at a given point. It tells us which way to move to minimize the function.

For a simple function:

J(ΞΈ)=ΞΈ2J(\theta) = \theta^2

The derivative (gradient) is:

βˆ‡J(ΞΈ)=ddΞΈ(ΞΈ2)=2ΞΈ\nabla J(\theta) = \frac{d}{d \theta}\left(\theta^2\right)= 2\theta

This means that for any value of ΞΈΞΈ, the gradient tells us how to adjust ΞΈΞΈ to descend toward the minimum.

Gradient Descent Formula

The weight update rule is:

ΞΈβ†ΞΈβˆ’Ξ±βˆ‡J(ΞΈ)\theta \larr \theta - \alpha \nabla J(\theta)

Where:

  • ΞΈ\theta - model parameter;
  • Ξ±\alpha - learning rate (step size);
  • βˆ‡J(ΞΈ)\nabla J(\theta) - gradient of the function we're aiming to minimize.

For our function:

ΞΈnew=ΞΈoldβˆ’Ξ±(2ΞΈold)\theta_{\text{new}} = \theta_{\text{old}} - \alpha\left(2\theta_{old}\right)

This means we update ΞΈΞΈ iteratively by subtracting the scaled gradient.

Stepwise Movement – A Visual

Example with start values: ΞΈ=3\theta = 3, Ξ±=0.3\alpha = 0.3

  1. ΞΈ1=3βˆ’0.3(2Γ—3)=3βˆ’1.8=1.2;\theta_1 = 3 - 0.3(2 \times 3) = 3 - 1.8 = 1.2;
  2. ΞΈ2=1.2βˆ’0.3(2Γ—1.2)=1.2βˆ’0.72=0.48;\theta_2 = 1.2 - 0.3(2 \times 1.2) = 1.2 - 0.72 = 0.48;
  3. ΞΈ3=0.48βˆ’0.3(2Γ—0.48)=0.48βˆ’0.288=0.192;\theta_3 = 0.48 - 0.3(2\times0.48) = 0.48 - 0.288 = 0.192;
  4. ΞΈ4=0.192βˆ’0.3(2Γ—0.192)=0.192βˆ’0.115=0.077.\theta_4 = 0.192 - 0.3(2 \times 0.192) = 0.192 - 0.115 = 0.077.

After a few iterations, we move toward ΞΈ=0ΞΈ=0, the minimum.

Learning Rate – Choosing Ξ± Wisely

  • Too large Β Ξ±\ \alpha - overshoots, never converges;
  • Too small Β Ξ±\ \alpha - converges too slowly;
  • Optimal Β Ξ±\ \alpha - balances speed & accuracy.

When Does Gradient Descent Stop?

Gradient descent stops when:

βˆ‡J(ΞΈ)β‰ˆ0\nabla J (\theta) \approx 0

This means that further updates are insignificant and we've found a minimum.

question mark

If the gradient βˆ‡J(ΞΈ)βˆ‡J(ΞΈ) is zero, what does this mean?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 9
some-alt