Cursos relacionados

Intermedio

Mathematics for Data Analysis and Modeling

To study many applied disciplines, it is necessary to have basic knowledge of higher mathematics and linear algebra. Mathematics is often used in data analysis, system modeling, building machine learning models, etc. Let's look at some of the most important topics and learn how to use the mathematical apparatus to solve real problems.

python

Mathematics

Understanding Gradient Descent

The Basics of Gradient Descent

by Ruslan Shudra

Data Scientist

Jul, 2024・
9 min read

What is optimization problem

In mathematics, an optimization problem is a type of mathematical problem where the objective is to find the best solution among all possible feasible solutions. The goal is to optimize (maximize or minimize) a certain objective function, subject to a set of constraints.

The solution to an optimization problem is a set of values for the decision variables that satisfy the constraints and optimize the objective function. Depending on the problem and constraints, the solution can be unique or may have multiple optimal solutions.
The most simple optimization problem can be defined as follows:

What is a gradient descend

Gradient descent is an optimization algorithm used to minimize a function by iteratively moving towards the steepest descent, i.e., the direction of the negative gradient.
In this case, gradient is a vector that consists of all partial derivatives of the function:

Gradient indicates the direction and rate of the steepest increase of the function. So if we use negative gradient, it will show us the direction of the steepest decrease of the function.
In gradient descent, parameters are adjusted in small steps, determined by the learning rate, to reduce the objective function's value. This process continues until the algorithm converges to a local or global minimum.

Run Code from Your Browser - No Installation Required

Algorithm description

The problem of minimization of function F(x1,..,xn) can be solved by constructing the following sequence of approximations:

We set a certain initial value x0 and η value representing the speed of gradient descent. Then we start the iterative process according to the formula above.

Stop criteria of the algorithm

The criteria for stopping iterations can be as follows:

Stop the algorithm after a certain number of iterations;
Iterate until the following condition is met:

Note, that eps = 10**(-6) or eps = 10**(-9) values are commonly used as the stop criterion of the iteration process.

We have to pay attention to two important features of the gradient descent method:

This method can only find the point of minimum of the function F(x). If you want to find a point of maximum, you can consider function -F(x) and use gradient descent for it;
If we compare the algorithm we discussed earlier with gradient descent, we can see that gradient descent performs a similar task to the first stage of the algorithm - finding a critical value, which might be a potential minimum point. As a result, it is possible that the point found by gradient descent may either be a local minimum within some subset of the domain or not a minimum point at all.

Gradient descent variants

Gradient descent has several variants that address different challenges and improve its efficiency and convergence. Here are some key gradient descent variants:

Momentum gradient descend

Momentum gradient descend incorporates a fraction of the previous update into the current update, which helps to smooth out oscillations and hasten convergence.

where:

vt is the velocity vector at iteration t;
β is the momentum parameter, typically in the range (0 < β < 1), controlling the contribution of the previous velocity vt-1 and the current gradient grad F(xt-1);
xt and xt-1 represent the parameter vectors at iterations t and t-1, respectively;
η denotes the learning rate, determining the step size in the direction of vt;
grad F(xt-1) represents the gradient of the objective function F with respect to the parameters x at iteration t-1.

It facilitates faster convergence and reduces oscillations in parameter updates, leading to more stable training dynamics.
On the other hand, implementation requires tuning an additional hyperparameter, the momentum term, which can affect performance if not optimized correctly.

RMSprop

RMSprop (Root Mean Square Propagation) is an optimization algorithm commonly used in training neural networks. It adapts the learning rate for each parameter based on the average of recent squared gradients, helping to mitigate the vanishing or exploding gradient problem. By smoothing out updates, RMSprop can improve training stability and convergence speed in deep learning models.

where:

v_t is an exponentially decaying average of squared gradients at iteration t;
β is a decay rate parameter (typically close to 1, e.g., beta = 0.9);
x_t and x_{t-1} is parameter vectors at iterations t and t-1;
η determines learning rate.
grad F(x_{t-1}) is gradient of the function F with respect to x at iteration t-1;
e is a small constant (e.g., 10^{-8}) added to v_t to prevent division by zero.

It maintains a more balanced and efficient learning rate, but requires tuning of the decay rate hyperparameter.

Adam

Adam (Adaptive Moment Estimation) is an optimization algorithm that combines the advantages of momentum and adaptive learning rates. It maintains exponentially decaying averages of past gradients (first moment) and squared gradients (second moment). These estimates are used to adaptively adjust the learning rate for each parameter, facilitating faster convergence and stable training dynamics in deep learning models.

where:

m_t and v_t are estimates of the first moment (mean) and second moment (uncentered variance) of the gradients, respectively;
β1 and β2 are the decay rates for the first moment and second moment estimates (typically close to 1, e.g., β1 = 0.9, β2 = 0.999);
m_hat_t and v_hat_t are bias-corrected estimates of m_t and v_t.
η is the learning rate;
ε is a small constant (e.g., 10^-8) to prevent division by zero.

Disadvantages of gradient descend methods

Gradient descent methods, while widely used, has several limitations:

Local minima and saddle points: gradient descent can get stuck in local minima or saddle points, where the gradient is zero or small in multiple dimensions, slowing down convergence;
Sensitivity to learning rate: the choice of learning rate is crucial; if it's too high, gradient descent can overshoot the minimum, and if it's too low, convergence can be slow;
High-dimensional spaces: in such spaces, gradients may vary widely, making convergence challenging and requiring careful tuning of hyperparameters;
Non-convex optimization: for non-convex functions, gradient descent may not find the global minimum but can get trapped in local minima or plateaus.

Start Learning Coding today and boost your Career Potential

FAQs

Q: What is gradient descent?

A: Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of the steepest descent of the function.

Q: How does gradient descent work?

A: It computes the gradient (derivative) of the loss function with respect to the model's parameters and updates the parameters in the opposite direction of the gradient to minimize the loss.

Q: What is the role of the learning rate in gradient descent?

A: The learning rate determines the size of steps taken during each iteration of gradient descent. A higher learning rate can lead to faster convergence, but it may also cause instability or overshooting of the optimal solution.

Q: What are some strategies to improve gradient descent performance?

A: Strategies include using adaptive learning rates (e.g., Adam, RMSprop), momentum to accelerate convergence, batch normalization to stabilize learning, and careful initialization of model parameters.

¿Fue útil este artículo?