Lära Loss Functions and Gradient Behavior | Comparing and Selecting Loss Functions

When you train a machine learning model, the loss function you select determines not only the direction in which the model's parameters are updated, but also the magnitude of those updates. This is because the gradient of the loss function with respect to the model's parameters guides the optimization process. To understand this, consider three common loss functions: mean squared error (MSE), mean absolute error (MAE), and cross-entropy.


              12345678910111213141516
            
import numpy as np
import matplotlib.pyplot as plt

errors = np.linspace(-4, 4, 400)

mse_grad = 2 * errors               # derivative of squared error
mae_grad = np.sign(errors)          # derivative of absolute error

plt.plot(errors, mse_grad, label="MSE Gradient")
plt.plot(errors, mae_grad, label="MAE Gradient")
plt.axhline(0, color="black", linewidth=0.5)
plt.title("Gradient Behavior of MSE vs MAE")
plt.xlabel("Error (y - ŷ)")
plt.ylabel("Gradient")
plt.legend()
plt.show()

The mean squared error loss is defined as the average of the squared differences between predictions and targets. Its gradient with respect to the prediction is proportional to the error itself—specifically, it grows linearly as the error increases. This means that the further a prediction is from the target, the larger the gradient, and thus the larger the update during training. This property can help the optimizer quickly correct large errors, but also makes the model sensitive to outliers.

The mean absolute error loss, by contrast, has a constant gradient (except at zero), regardless of how large the error is. This means that even for large errors, the update magnitude remains the same. As a result, MAE is less sensitive to outliers, but it can lead to slower convergence because it does not "push" as hard to correct large mistakes.


              123456789
            
p = np.linspace(0.001, 0.999, 400)   # predicted probability for true class y=1
ce_grad = -(1/p) * (1 - p)           # simplified gradient w.r.t. logits

plt.plot(p, np.abs(ce_grad))
plt.title("Cross-Entropy Gradient Magnitude (y=1)")
plt.xlabel("Predicted Probability p")
plt.ylabel("|Gradient|")
plt.ylim(0, 20)
plt.show()

Cross-entropy loss, especially in classification tasks, produces gradients that depend on the predicted probability. When the model is very confident but wrong, the gradient can be quite large, leading to strong corrections. When the model is already close to the correct answer, the gradient becomes small, allowing for fine-tuning. This dynamic is crucial for fast and stable learning in classification problems.

The shape and behavior of these gradients have significant implications for learning dynamics. Loss functions with gradients that grow quickly can speed up convergence for large errors but risk instability or exploding gradients. Those with small or constant gradients are more stable but may converge slowly, especially if the model starts far from the optimum.

Note

The choice of loss function directly impacts the magnitude and direction of gradients during training. This, in turn, affects how quickly and reliably your model converges. Selecting an appropriate loss function is essential for balancing speed, stability, and robustness in optimization.

Problems such as vanishing or exploding gradients often arise from the mathematical properties of the chosen loss function. With MSE, if the model's predictions are very far from the targets, the gradients can become very large, causing the optimizer to take overly aggressive steps—this is known as the exploding gradient problem. Conversely, with loss functions whose gradients shrink rapidly as the prediction improves (such as cross-entropy with softmax for confident, correct predictions), the updates can become extremely small, leading to vanishing gradients and slow learning. MAE's constant gradient avoids both extremes, but its lack of sensitivity to error magnitude can make optimization less efficient, especially in early training when large errors are common.


              12345678910
            
errors = np.array([0.1, 1, 5, 20])

mse_grad = 2 * errors
mae_grad = np.sign(errors)
ce_grad = 1 / (errors + 1e-6)  # toy inversely proportional gradient

print("Errors:", errors)
print("MSE gradients:", mse_grad)
print("MAE gradients:", mae_grad)
print("Toy CE-like gradients:", ce_grad)

Understanding these behaviors is critical for diagnosing training problems and selecting loss functions that match your data and task. For instance, if your model is not learning because gradients are too small, you might consider a loss function with steeper gradients for large errors. If your model's weights are diverging, a loss function with more moderate gradients may help stabilize training.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 5. Kapitel 1

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain more about how to choose the right loss function for a specific problem?

What are some strategies to handle vanishing or exploding gradients during training?

Can you summarize the main differences between MSE, MAE, and cross-entropy in practical terms?

Svep för att visa menyn


              12345678910111213141516
            
import numpy as np
import matplotlib.pyplot as plt

errors = np.linspace(-4, 4, 400)

mse_grad = 2 * errors               # derivative of squared error
mae_grad = np.sign(errors)          # derivative of absolute error

plt.plot(errors, mse_grad, label="MSE Gradient")
plt.plot(errors, mae_grad, label="MAE Gradient")
plt.axhline(0, color="black", linewidth=0.5)
plt.title("Gradient Behavior of MSE vs MAE")
plt.xlabel("Error (y - ŷ)")
plt.ylabel("Gradient")
plt.legend()
plt.show()


              123456789
            
p = np.linspace(0.001, 0.999, 400)   # predicted probability for true class y=1
ce_grad = -(1/p) * (1 - p)           # simplified gradient w.r.t. logits

plt.plot(p, np.abs(ce_grad))
plt.title("Cross-Entropy Gradient Magnitude (y=1)")
plt.xlabel("Predicted Probability p")
plt.ylabel("|Gradient|")
plt.ylim(0, 20)
plt.show()

Note


              12345678910
            
errors = np.array([0.1, 1, 5, 20])

mse_grad = 2 * errors
mae_grad = np.sign(errors)
ce_grad = 1 / (errors + 1e-6)  # toy inversely proportional gradient

print("Errors:", errors)
print("MSE gradients:", mse_grad)
print("MAE gradients:", mae_grad)
print("Toy CE-like gradients:", ce_grad)

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 5. Kapitel 1