Gradient Descent: The Core Optimizer
At the heart of neural network optimization lies gradient descent, a method designed to minimize a loss function by iteratively updating the network's parameters. The mathematical foundation of gradient descent is rooted in calculus: you compute the gradient (or derivative) of the loss with respect to each parameter, which points in the direction of steepest ascent. By moving in the opposite direction of this gradient, you descend toward a local minimum. The update rule for each parameter w at iteration t is:
wt+1=wt−η⋅∂wt∂LHere, wt is the current parameter value, η is the learning rate, and ∂wt∂L is the gradient of the loss L with respect to wt. The learning rate controls the step size for each update, determining how far you move in the direction opposite the gradient. This process is repeated until the loss function converges to a minimum, ideally resulting in a well-trained neural network.
12345678910111213141516171819202122232425262728293031323334353637383940import numpy as np import matplotlib.pyplot as plt # Generate toy data: y = 2x + 1 with noise np.random.seed(42) X = np.linspace(0, 1, 100) y = 2 * X + 1 + np.random.normal(0, 0.1, size=X.shape) learning_rates = [0.01, 0.1, 0.5] epochs = 100 loss_histories = [] labels = [] for lr in learning_rates: # Initialize weights and bias for each run w = np.random.randn() b = np.random.randn() loss_history = [] for epoch in range(epochs): y_pred = w * X + b loss = np.mean((y - y_pred) ** 2) loss_history.append(loss) dw = -2 * np.mean((y - y_pred) * X) db = -2 * np.mean(y - y_pred) w = w - lr * dw b = b - lr * db loss_histories.append(loss_history) labels.append(f"learning rate = {lr}") # Plot training loss for each learning rate plt.figure(figsize=(8, 4)) for loss_history, label in zip(loss_histories, labels): plt.plot(loss_history, label=label) plt.title('Training Loss for Different Learning Rates') plt.xlabel('Epoch') plt.ylabel('Mean Squared Error Loss') plt.legend() plt.grid(True) plt.tight_layout() plt.show()
The learning rate is a critical hyperparameter in gradient descent. If you set the learning rate too high, you risk overshooting the minimum, causing the loss to diverge or oscillate without settling. If it is too low, convergence becomes painfully slow, and you may get stuck in local minima or plateaus. By experimenting with the code sample above, you can observe how modifying the learning_rate variable affects the speed and stability of optimization. A moderate learning rate ensures smooth and efficient convergence, while extreme values can lead to suboptimal or unstable training outcomes.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 8.33
Gradient Descent: The Core Optimizer
Swipe to show menu
At the heart of neural network optimization lies gradient descent, a method designed to minimize a loss function by iteratively updating the network's parameters. The mathematical foundation of gradient descent is rooted in calculus: you compute the gradient (or derivative) of the loss with respect to each parameter, which points in the direction of steepest ascent. By moving in the opposite direction of this gradient, you descend toward a local minimum. The update rule for each parameter w at iteration t is:
wt+1=wt−η⋅∂wt∂LHere, wt is the current parameter value, η is the learning rate, and ∂wt∂L is the gradient of the loss L with respect to wt. The learning rate controls the step size for each update, determining how far you move in the direction opposite the gradient. This process is repeated until the loss function converges to a minimum, ideally resulting in a well-trained neural network.
12345678910111213141516171819202122232425262728293031323334353637383940import numpy as np import matplotlib.pyplot as plt # Generate toy data: y = 2x + 1 with noise np.random.seed(42) X = np.linspace(0, 1, 100) y = 2 * X + 1 + np.random.normal(0, 0.1, size=X.shape) learning_rates = [0.01, 0.1, 0.5] epochs = 100 loss_histories = [] labels = [] for lr in learning_rates: # Initialize weights and bias for each run w = np.random.randn() b = np.random.randn() loss_history = [] for epoch in range(epochs): y_pred = w * X + b loss = np.mean((y - y_pred) ** 2) loss_history.append(loss) dw = -2 * np.mean((y - y_pred) * X) db = -2 * np.mean(y - y_pred) w = w - lr * dw b = b - lr * db loss_histories.append(loss_history) labels.append(f"learning rate = {lr}") # Plot training loss for each learning rate plt.figure(figsize=(8, 4)) for loss_history, label in zip(loss_histories, labels): plt.plot(loss_history, label=label) plt.title('Training Loss for Different Learning Rates') plt.xlabel('Epoch') plt.ylabel('Mean Squared Error Loss') plt.legend() plt.grid(True) plt.tight_layout() plt.show()
The learning rate is a critical hyperparameter in gradient descent. If you set the learning rate too high, you risk overshooting the minimum, causing the loss to diverge or oscillate without settling. If it is too low, convergence becomes painfully slow, and you may get stuck in local minima or plateaus. By experimenting with the code sample above, you can observe how modifying the learning_rate variable affects the speed and stability of optimization. A moderate learning rate ensures smooth and efficient convergence, while extreme values can lead to suboptimal or unstable training outcomes.
Thanks for your feedback!