Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Loss Function | Neural Network from Scratch
Introduction to Neural Networks

bookLoss Function

In training a neural network, it is necessary to measure how accurately the model predicts the correct results. This is done using a loss function, which calculates the difference between the model’s predictions and the actual target values. The objective of training is to minimize this loss, making the predictions as close as possible to the true outputs.

For binary classification tasks, one of the most widely used loss functions is the cross-entropy loss, which is particularly effective for models that output probabilities.

Derivation of Cross-Entropy Loss

To understand the cross-entropy loss, consider the maximum likelihood principle. In a binary classification problem, the goal is to train a model that estimates the probability y^\hat{y} that a given input belongs to class 1. The true label yy can take one of two values: 0 or 1.

An effective model should assign high probabilities to correct predictions. This idea is formalized through the likelihood function, which represents the probability of observing the actual data given the model’s predictions.

For a single training example, assuming independence, the likelihood can be expressed as:

P(y∣x)=y^y(1βˆ’y^)1βˆ’yP(y|x) = \hat{y}^y(1 - \hat{y})^{1 - y}

This expression means the following:

  • If y=1y = 1, then P(y∣x)=y^P(y|x) = \hat{y} β€” the model should assign a high probability to class 1;
  • If y=0y = 0, then P(y∣x)=1βˆ’y^P(y|x) = 1 - \hat{y} β€” the model should assign a high probability to class 0.

In both cases, the objective is to maximize the probability that the model assigns to the correct class.

Note
Note

P(y∣x)P(y|x) means the probability of observing the actual class label yy given the inputs xx.

To simplify optimization, the log-likelihood is used instead of the likelihood function because taking the logarithm converts products into sums, making differentiation more straightforward:

log⁑P(y∣x)=ylog⁑(y^)+(1βˆ’y)log⁑(1βˆ’y^)\log P(y|x) = y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})

Since training aims to maximize the log-likelihood, the loss function is defined as its negative value so that the optimization process becomes a minimization problem:

L=βˆ’(ylog⁑(y^)+(1βˆ’y)log⁑(1βˆ’y^))L = -\big(y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})\big)

This is the binary cross-entropy loss function, commonly used for classification problems.

Given that the output variable represents y^\hat{y} for a particular training example, and the target variable represents yy for this training example, this loss function can be implemented as follows:

import numpy as np

loss = -(target * np.log(output) + (1 - target) * np.log(1 - output))

Why This Formula?

Cross-entropy loss has a clear intuitive interpretation:

  • If y=1y = 1, the loss simplifies to βˆ’log⁑(y^)-\log(\hat{y}), meaning the loss is low when y^\hat{y} is close to 1 and very high when y^\hat{y} is close to 0;
  • If y=0y = 0, the loss simplifies to βˆ’log⁑(1βˆ’y^)-\log(1 - \hat{y}), meaning the loss is low when y^\hat{y} is close to 0 and very high when it is close to 1.

Since logarithms grow negatively large as their input approaches zero, incorrect predictions are heavily penalized, encouraging the model to make confident, correct predictions.

If multiple examples are passed during forward propagation, the total loss is computed as the average loss across all examples:

L=βˆ’1Nβˆ‘i=1N(yilog⁑(y^i)+(1βˆ’yi)log⁑(1βˆ’y^i))L = -\frac1N \sum_{i=1}^N (y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i))

where NN is the number of training samples.

question mark

Which of the following best describes the purpose of the cross-entropy loss function in binary classification?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 6

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 4

bookLoss Function

Swipe to show menu

In training a neural network, it is necessary to measure how accurately the model predicts the correct results. This is done using a loss function, which calculates the difference between the model’s predictions and the actual target values. The objective of training is to minimize this loss, making the predictions as close as possible to the true outputs.

For binary classification tasks, one of the most widely used loss functions is the cross-entropy loss, which is particularly effective for models that output probabilities.

Derivation of Cross-Entropy Loss

To understand the cross-entropy loss, consider the maximum likelihood principle. In a binary classification problem, the goal is to train a model that estimates the probability y^\hat{y} that a given input belongs to class 1. The true label yy can take one of two values: 0 or 1.

An effective model should assign high probabilities to correct predictions. This idea is formalized through the likelihood function, which represents the probability of observing the actual data given the model’s predictions.

For a single training example, assuming independence, the likelihood can be expressed as:

P(y∣x)=y^y(1βˆ’y^)1βˆ’yP(y|x) = \hat{y}^y(1 - \hat{y})^{1 - y}

This expression means the following:

  • If y=1y = 1, then P(y∣x)=y^P(y|x) = \hat{y} β€” the model should assign a high probability to class 1;
  • If y=0y = 0, then P(y∣x)=1βˆ’y^P(y|x) = 1 - \hat{y} β€” the model should assign a high probability to class 0.

In both cases, the objective is to maximize the probability that the model assigns to the correct class.

Note
Note

P(y∣x)P(y|x) means the probability of observing the actual class label yy given the inputs xx.

To simplify optimization, the log-likelihood is used instead of the likelihood function because taking the logarithm converts products into sums, making differentiation more straightforward:

log⁑P(y∣x)=ylog⁑(y^)+(1βˆ’y)log⁑(1βˆ’y^)\log P(y|x) = y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})

Since training aims to maximize the log-likelihood, the loss function is defined as its negative value so that the optimization process becomes a minimization problem:

L=βˆ’(ylog⁑(y^)+(1βˆ’y)log⁑(1βˆ’y^))L = -\big(y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})\big)

This is the binary cross-entropy loss function, commonly used for classification problems.

Given that the output variable represents y^\hat{y} for a particular training example, and the target variable represents yy for this training example, this loss function can be implemented as follows:

import numpy as np

loss = -(target * np.log(output) + (1 - target) * np.log(1 - output))

Why This Formula?

Cross-entropy loss has a clear intuitive interpretation:

  • If y=1y = 1, the loss simplifies to βˆ’log⁑(y^)-\log(\hat{y}), meaning the loss is low when y^\hat{y} is close to 1 and very high when y^\hat{y} is close to 0;
  • If y=0y = 0, the loss simplifies to βˆ’log⁑(1βˆ’y^)-\log(1 - \hat{y}), meaning the loss is low when y^\hat{y} is close to 0 and very high when it is close to 1.

Since logarithms grow negatively large as their input approaches zero, incorrect predictions are heavily penalized, encouraging the model to make confident, correct predictions.

If multiple examples are passed during forward propagation, the total loss is computed as the average loss across all examples:

L=βˆ’1Nβˆ‘i=1N(yilog⁑(y^i)+(1βˆ’yi)log⁑(1βˆ’y^i))L = -\frac1N \sum_{i=1}^N (y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i))

where NN is the number of training samples.

question mark

Which of the following best describes the purpose of the cross-entropy loss function in binary classification?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 6
some-alt