Learn Loss Function | Neural Network from Scratch

In training a neural network, it is necessary to measure how accurately the model predicts the correct results. This is done using a loss function, which calculates the difference between the model’s predictions and the actual target values. The objective of training is to minimize this loss, making the predictions as close as possible to the true outputs.

For binary classification tasks, one of the most widely used loss functions is the cross-entropy loss, which is particularly effective for models that output probabilities.

Derivation of Cross-Entropy Loss

To understand the cross-entropy loss, consider the maximum likelihood principle. In a binary classification problem, the goal is to train a model that estimates the probability $\hat{y}$ that a given input belongs to class 1. The true label $y$ can take one of two values: 0 or 1.

An effective model should assign high probabilities to correct predictions. This idea is formalized through the likelihood function, which represents the probability of observing the actual data given the model’s predictions.

For a single training example, assuming independence, the likelihood can be expressed as:

P(y|x) = \hat{y}^y(1 - \hat{y})^{1 - y}

This expression means the following:

If $y = 1$ , then $P(y|x) = \hat{y}$ — the model should assign a high probability to class 1;
If $y = 0$ , then $P(y|x) = 1 - \hat{y}$ — the model should assign a high probability to class 0.

In both cases, the objective is to maximize the probability that the model assigns to the correct class.

Note

$P(y|x)$ means the probability of observing the actual class label $y$ given the inputs $x$ .

To simplify optimization, the log-likelihood is used instead of the likelihood function because taking the logarithm converts products into sums, making differentiation more straightforward:

\log P(y|x) = y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})

Since training aims to maximize the log-likelihood, the loss function is defined as its negative value so that the optimization process becomes a minimization problem:

L = -\big(y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})\big)

This is the binary cross-entropy loss function, commonly used for classification problems.

Given that the output variable represents $\hat{y}$ for a particular training example, and the target variable represents $y$ for this training example, this loss function can be implemented as follows:

import numpy as np

loss = -(target * np.log(output) + (1 - target) * np.log(1 - output))

Why This Formula?

Cross-entropy loss has a clear intuitive interpretation:

If $y = 1$ , the loss simplifies to $-\log(\hat{y})$ , meaning the loss is low when $\hat{y}$ is close to 1 and very high when $\hat{y}$ is close to 0;
If $y = 0$ , the loss simplifies to $-\log(1 - \hat{y})$ , meaning the loss is low when $\hat{y}$ is close to 0 and very high when it is close to 1.

Since logarithms grow negatively large as their input approaches zero, incorrect predictions are heavily penalized, encouraging the model to make confident, correct predictions.

If multiple examples are passed during forward propagation, the total loss is computed as the average loss across all examples:

L = -\frac1N \sum_{i=1}^N (y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i))

where $N$ is the number of training samples.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 6

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 4

Swipe to show menu