Learn Modern Loss Variations: Focal Loss and Label Smoothing

In modern machine learning, especially in classification tasks, you often encounter challenges such as class imbalance and model overconfidence. To address these, specialized loss functions like focal loss and label smoothing have been developed. Focal loss is particularly effective for datasets where some classes are much less frequent than others, while label smoothing is a regularization technique that helps models generalize better by preventing them from becoming too confident in their predictions.


              123456789101112131415
            
import numpy as np
import matplotlib.pyplot as plt

p = np.linspace(0.01, 0.999, 400)  # predicted probability for true class
gammas = [0, 1, 2, 5]

for g in gammas:
    focal = -((1 - p)**g) * np.log(p)
    plt.plot(p, focal, label=f"γ = {g}")

plt.title("Focal Loss for Different γ Values (α=1)")
plt.xlabel("Predicted Probability pₜ")
plt.ylabel("Loss")
plt.legend()
plt.show()

The focal loss modifies the standard cross-entropy loss to reduce the relative loss for well-classified examples and focus more on hard, misclassified ones. Mathematically, for a binary classification problem, the focal loss is defined as:

\text{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)

where:

$p_t$ is the predicted probability for the true class;
$\alpha_t$ is a weighting factor to balance positive and negative classes;
$\gamma$ (gamma) is the focusing parameter that adjusts the rate at which easy examples are down-weighted.

The term $(1 - p_t)^\gamma$ acts as a modulating factor. When an example is misclassified and $p_t$ is small, this factor is near 1 and the loss is unaffected. As $p_t$ increases (the example is correctly classified with high confidence), the factor goes to zero, reducing the loss contribution from these easy examples. This mechanism encourages the model to focus on learning from hard examples that are currently being misclassified.


              12345678
            
import numpy as np

p_t = np.array([0.95, 0.7, 0.4, 0.1])  # correct_probs
gamma = 2
FL = -((1 - p_t)**gamma) * np.log(p_t)

for p, l in zip(p_t, FL):
    print(f"pₜ={p:.2f} → focal loss={l:.3f}")

Note

Focal loss places greater emphasis on hard, misclassified examples by reducing the loss contribution from easy, well-classified ones. This is especially useful in imbalanced datasets where the model might otherwise be overwhelmed by the majority class. Label smoothing, on the other hand, prevents the model from becoming overconfident by softening the target labels. Instead of training the model to assign all probability to a single class, label smoothing encourages the model to spread some probability mass to other classes, which can lead to better generalization and improved calibration.

Label smoothing is a simple yet powerful regularization technique for classification. Normally, the target label for class $k$ is represented as a one-hot vector: 1 for the correct class and 0 for all others. With label smoothing, you modify the target so that the correct class is assigned a value slightly less than 1, and the remaining probability is distributed among the other classes. For example, with a smoothing parameter $ε$ , the new target for class $k$ becomes $1 - ε$ , and each incorrect class receives $ε/(K-1)$ , where $K$ is the number of classes.


              123456789101112
            
import numpy as np

K = 4
epsilon = 0.1
true_class = 2

one_hot = np.eye(K)[true_class]
smooth = np.full(K, epsilon / (K - 1))
smooth[true_class] = 1 - epsilon

print("One-hot:     ", one_hot)
print("Smoothed:    ", smooth)

This approach discourages the model from becoming overly confident in its predictions, which can improve generalization and reduce susceptibility to overfitting. By making the targets less certain, label smoothing helps the model learn more robust representations and can lead to better performance on unseen data.

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the main differences between focal loss and label smoothing?

How do I choose between focal loss and label smoothing for my dataset?

Can you provide more examples of when to use these techniques?

Swipe to show menu


              123456789101112131415
            
import numpy as np
import matplotlib.pyplot as plt

p = np.linspace(0.01, 0.999, 400)  # predicted probability for true class
gammas = [0, 1, 2, 5]

for g in gammas:
    focal = -((1 - p)**g) * np.log(p)
    plt.plot(p, focal, label=f"γ = {g}")

plt.title("Focal Loss for Different γ Values (α=1)")
plt.xlabel("Predicted Probability pₜ")
plt.ylabel("Loss")
plt.legend()
plt.show()

\text{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)

where:

$p_t$ is the predicted probability for the true class;
$\alpha_t$ is a weighting factor to balance positive and negative classes;
$\gamma$ (gamma) is the focusing parameter that adjusts the rate at which easy examples are down-weighted.


              12345678
            
import numpy as np

p_t = np.array([0.95, 0.7, 0.4, 0.1])  # correct_probs
gamma = 2
FL = -((1 - p_t)**gamma) * np.log(p_t)

for p, l in zip(p_t, FL):
    print(f"pₜ={p:.2f} → focal loss={l:.3f}")

Note


              123456789101112
            
import numpy as np

K = 4
epsilon = 0.1
true_class = 2

one_hot = np.eye(K)[true_class]
smooth = np.full(K, epsilon / (K - 1))
smooth[true_class] = 1 - epsilon

print("One-hot:     ", one_hot)
print("Smoothed:    ", smooth)

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 3