Learn Backward Propagation | Neural Network from Scratch

Backward propagation, or backpropagation, is the process of determining how the loss function changes with respect to each parameter in the neural network. The goal is to adjust these parameters in a way that reduces the overall loss.

This process relies on the gradient descent algorithm, which uses the derivatives of the loss with respect to each layer’s pre-activation values (the raw outputs before applying the activation function) and propagates them backward through the network.

Since every layer contributes to the final prediction, the gradients are computed step by step:

Perform forward propagation to obtain the outputs;
Compute the derivative of the loss with respect to the output pre-activation;
Propagate this derivative backward through the layers using the chain rule;
Compute and use the gradients for weights and biases to update them during training.

Note

Gradients represent the rate of change of a function with respect to its inputs, meaning they are its derivatives. They indicate how much a small change in weights, biases, or activations affects the loss function, guiding the model's learning process through gradient descent.

Notation

To make explanation clearer, let's use the following notation:

$W^l$ is the weight matrix of layer $l$ ;
$b^l$ is the vector of biases of layer $l$ ;
$z^l$ is the vector of pre-activations of layer $l$ ;
$a^l$ is the vector of activation of layer $l$ ;

Therefore, setting $a^0$ to $x$ (the inputs), forward propagation in a perceptron with n layers can be described as the following sequence of operations:

\begin{aligned} a^0 &= x, & &... & &...\\ z^1 &= W^1 a^0 + b^1, & z^l &= W^l a^{l-1} + b^l, & z^n &= W^n a^{n-1} + b^n,\\ a^1 &= f^1(z^1), & a^l &= f^l(z^l), & a^n &= f^n(z^n),\\ &... & &... & \hat y &= a^n. \end{aligned}

To describe backpropagation mathematically, introducing the following notations:

$da^l$ : derivative of the loss with respect to the activations at layer $l$ ;
$dz^l$ : derivative of the loss with respect to the pre-activations at layer $l$ (before applying the activation function);
$dW^l$ : derivative of the loss with respect to the weights at layer $l$ ;
$db^l$ : derivative of the loss with respect to the biases at layer $l$ .

Computing Gradients for the Output Layer

At the final layer $n$ , the first step is to compute the gradient of the loss with respect to the activations of the output layer, denoted as $da^n$ .

Then, using the chain rule, the gradient of the loss with respect to the pre-activations of the output layer is calculated as:

dz^n = da^n \odot f'^n(z^n)

Here, $f'^n(z^n)$ represents the derivative of the activation function at layer $n$ , and the symbol $\odot$ denotes element-wise multiplication.

Note

The symbol $\odot$ denotes element-wise multiplication, which means each element of one vector is multiplied by the corresponding element of another vector. In contrast, the symbol $\cdot$ represents the dot product, used for standard matrix or vector multiplication. The term $f'^n$ refers to the derivative of the activation function at the output layer.

This value indicates how sensitive the loss function is to changes in the pre-activation values of the output layer.

After computing $dz^n$ , the next step is to calculate the gradients for the weights and biases:

\begin{aligned} dW^n &= dz^n \cdot (a^{n-1})^T,\\ db^n &= dz^n \end{aligned}

These gradients describe how much each weight and bias in the output layer should be adjusted to reduce the loss.

Here, $(a^{n-1})^T$ is the transposed activation vector from the previous layer. If the original vector has the shape $n_{\text{neurons}} \times 1$ , its transpose has the shape $1 \times n_{\text{neurons}}$ .

To continue the backward propagation, the derivative of the loss with respect to the activations of the previous layer is computed as:

da^{n-1} = (W^n)^T \cdot dz^n

This expression allows the error signal to be passed backward through the network, enabling the adjustment of earlier layers during training.

Propagating Gradients to the Hidden Layers

For each hidden layer $l$ the procedure is the same. Given $da^l$ :

Compute the derivative of the loss with respect to the pre-activations;
Compute the gradients for the weights and biases;
Compute $da^{l-1}$ to propagate the derivative backward.

\begin{aligned} dz^l &= da^l \odot f'^l(z^l)\\ dW^l &= dz^l \cdot (a^{l-1})^T\\ db^l &= dz^l\\ da^{l-1} &= (W^l)^T \cdot dz^l \end{aligned}

This process is repeated for each preceding layer, step by step, until the input layer is reached.

Updating Weights and Biases

After computing the gradients for all layers, the weights and biases are updated using the gradient descent algorithm:

\begin{aligned} W^l &= W^l - \alpha \cdot dW^l,\\ b^l &= b^l - \alpha \cdot db^l. \end{aligned}

Here, $\alpha$ represents the learning rate, which controls how much the parameters are adjusted during each training step.

Here, $\alpha$ is the learning rate, a hyperparameter that determines the size of the adjustment applied to the weights and biases during each update step.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 7

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how the chain rule is applied in backpropagation?

What is the difference between pre-activations and activations in a neural network?

Can you provide an example of calculating gradients for a simple neural network?

Awesome!

Completion rate improved to 4

Swipe to show menu

Since every layer contributes to the final prediction, the gradients are computed step by step:

Perform forward propagation to obtain the outputs;
Compute the derivative of the loss with respect to the output pre-activation;
Propagate this derivative backward through the layers using the chain rule;
Compute and use the gradients for weights and biases to update them during training.

Note

Notation

To make explanation clearer, let's use the following notation:

$W^l$ is the weight matrix of layer $l$ ;
$b^l$ is the vector of biases of layer $l$ ;
$z^l$ is the vector of pre-activations of layer $l$ ;
$a^l$ is the vector of activation of layer $l$ ;

Therefore, setting $a^0$ to $x$ (the inputs), forward propagation in a perceptron with n layers can be described as the following sequence of operations:

\begin{aligned} a^0 &= x, & &... & &...\\ z^1 &= W^1 a^0 + b^1, & z^l &= W^l a^{l-1} + b^l, & z^n &= W^n a^{n-1} + b^n,\\ a^1 &= f^1(z^1), & a^l &= f^l(z^l), & a^n &= f^n(z^n),\\ &... & &... & \hat y &= a^n. \end{aligned}

To describe backpropagation mathematically, introducing the following notations:

$da^l$ : derivative of the loss with respect to the activations at layer $l$ ;
$dz^l$ : derivative of the loss with respect to the pre-activations at layer $l$ (before applying the activation function);
$dW^l$ : derivative of the loss with respect to the weights at layer $l$ ;
$db^l$ : derivative of the loss with respect to the biases at layer $l$ .

Computing Gradients for the Output Layer

At the final layer $n$ , the first step is to compute the gradient of the loss with respect to the activations of the output layer, denoted as $da^n$ .

Then, using the chain rule, the gradient of the loss with respect to the pre-activations of the output layer is calculated as:

dz^n = da^n \odot f'^n(z^n)

Here, $f'^n(z^n)$ represents the derivative of the activation function at layer $n$ , and the symbol $\odot$ denotes element-wise multiplication.

Note

This value indicates how sensitive the loss function is to changes in the pre-activation values of the output layer.

After computing $dz^n$ , the next step is to calculate the gradients for the weights and biases:

\begin{aligned} dW^n &= dz^n \cdot (a^{n-1})^T,\\ db^n &= dz^n \end{aligned}

These gradients describe how much each weight and bias in the output layer should be adjusted to reduce the loss.

To continue the backward propagation, the derivative of the loss with respect to the activations of the previous layer is computed as:

da^{n-1} = (W^n)^T \cdot dz^n

This expression allows the error signal to be passed backward through the network, enabling the adjustment of earlier layers during training.

Propagating Gradients to the Hidden Layers

For each hidden layer $l$ the procedure is the same. Given $da^l$ :

Compute the derivative of the loss with respect to the pre-activations;
Compute the gradients for the weights and biases;
Compute $da^{l-1}$ to propagate the derivative backward.

\begin{aligned} dz^l &= da^l \odot f'^l(z^l)\\ dW^l &= dz^l \cdot (a^{l-1})^T\\ db^l &= dz^l\\ da^{l-1} &= (W^l)^T \cdot dz^l \end{aligned}

This process is repeated for each preceding layer, step by step, until the input layer is reached.

Updating Weights and Biases

After computing the gradients for all layers, the weights and biases are updated using the gradient descent algorithm:

\begin{aligned} W^l &= W^l - \alpha \cdot dW^l,\\ b^l &= b^l - \alpha \cdot db^l. \end{aligned}

Here, $\alpha$ represents the learning rate, which controls how much the parameters are adjusted during each training step.

Here, $\alpha$ is the learning rate, a hyperparameter that determines the size of the adjustment applied to the weights and biases during each update step.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 7