Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Backward Propagation | Neural Network from Scratch
Introduction to Neural Networks

bookBackward Propagation

Backward propagation, or backpropagation, is the process of determining how the loss function changes with respect to each parameter in the neural network. The goal is to adjust these parameters in a way that reduces the overall loss.

This process relies on the gradient descent algorithm, which uses the derivatives of the loss with respect to each layer’s pre-activation values (the raw outputs before applying the activation function) and propagates them backward through the network.

Since every layer contributes to the final prediction, the gradients are computed step by step:

  1. Perform forward propagation to obtain the outputs;
  2. Compute the derivative of the loss with respect to the output pre-activation;
  3. Propagate this derivative backward through the layers using the chain rule;
  4. Compute and use the gradients for weights and biases to update them during training.
Note
Note

Gradients represent the rate of change of a function with respect to its inputs, meaning they are its derivatives. They indicate how much a small change in weights, biases, or activations affects the loss function, guiding the model's learning process through gradient descent.

Notation

To make explanation clearer, let's use the following notation:

  • WlW^l is the weight matrix of layer ll;
  • blb^l is the vector of biases of layer ll;
  • zlz^l is the vector of pre-activations of layer ll;
  • ala^l is the vector of activation of layer ll;

Therefore, setting a0a^0 to xx (the inputs), forward propagation in a perceptron with n layers can be described as the following sequence of operations:

a0=x,......z1=W1a0+b1,zl=Wlalβˆ’1+bl,zn=Wnanβˆ’1+bn,a1=f1(z1),al=fl(zl),an=fn(zn),......y^=an.\begin{aligned} a^0 &= x, & &... & &...\\ z^1 &= W^1 a^0 + b^1, & z^l &= W^l a^{l-1} + b^l, & z^n &= W^n a^{n-1} + b^n,\\ a^1 &= f^1(z^1), & a^l &= f^l(z^l), & a^n &= f^n(z^n),\\ &... & &... & \hat y &= a^n. \end{aligned}

To describe backpropagation mathematically, introducing the following notations:

  • dalda^l: derivative of the loss with respect to the activations at layer ll;
  • dzldz^l: derivative of the loss with respect to the pre-activations at layer ll (before applying the activation function);
  • dWldW^l: derivative of the loss with respect to the weights at layer ll;
  • dbldb^l: derivative of the loss with respect to the biases at layer ll.

Computing Gradients for the Output Layer

At the final layer nn, the first step is to compute the gradient of the loss with respect to the activations of the output layer, denoted as danda^n.

Then, using the chain rule, the gradient of the loss with respect to the pre-activations of the output layer is calculated as:

dzn=danβŠ™fβ€²n(zn)dz^n = da^n \odot f'^n(z^n)

Here, fβ€²n(zn)f'^n(z^n) represents the derivative of the activation function at layer nn, and the symbol βŠ™\odot denotes element-wise multiplication.

Note
Note

The symbol βŠ™\odot denotes element-wise multiplication, which means each element of one vector is multiplied by the corresponding element of another vector. In contrast, the symbol β‹…\cdot represents the dot product, used for standard matrix or vector multiplication. The term fβ€²nf'^n refers to the derivative of the activation function at the output layer.

This value indicates how sensitive the loss function is to changes in the pre-activation values of the output layer.

After computing dzndz^n, the next step is to calculate the gradients for the weights and biases:

dWn=dznβ‹…(anβˆ’1)T,dbn=dzn\begin{aligned} dW^n &= dz^n \cdot (a^{n-1})^T,\\ db^n &= dz^n \end{aligned}

These gradients describe how much each weight and bias in the output layer should be adjusted to reduce the loss.

Here, (anβˆ’1)T(a^{n-1})^T is the transposed activation vector from the previous layer. If the original vector has the shape nneuronsΓ—1n_{\text{neurons}} \times 1, its transpose has the shape 1Γ—nneurons1 \times n_{\text{neurons}}.

To continue the backward propagation, the derivative of the loss with respect to the activations of the previous layer is computed as:

danβˆ’1=(Wn)Tβ‹…dznda^{n-1} = (W^n)^T \cdot dz^n

This expression allows the error signal to be passed backward through the network, enabling the adjustment of earlier layers during training.

Propagating Gradients to the Hidden Layers

For each hidden layer ll the procedure is the same. Given dalda^l:

  1. Compute the derivative of the loss with respect to the pre-activations;
  2. Compute the gradients for the weights and biases;
  3. Compute dalβˆ’1da^{l-1} to propagate the derivative backward.
dzl=dalβŠ™fβ€²l(zl)dWl=dzlβ‹…(alβˆ’1)Tdbl=dzldalβˆ’1=(Wl)Tβ‹…dzl\begin{aligned} dz^l &= da^l \odot f'^l(z^l)\\ dW^l &= dz^l \cdot (a^{l-1})^T\\ db^l &= dz^l\\ da^{l-1} &= (W^l)^T \cdot dz^l \end{aligned}

This process is repeated for each preceding layer, step by step, until the input layer is reached.

Updating Weights and Biases

After computing the gradients for all layers, the weights and biases are updated using the gradient descent algorithm:

Wl=Wlβˆ’Ξ±β‹…dWl,bl=blβˆ’Ξ±β‹…dbl.\begin{aligned} W^l &= W^l - \alpha \cdot dW^l,\\ b^l &= b^l - \alpha \cdot db^l. \end{aligned}

Here, Ξ±\alpha represents the learning rate, which controls how much the parameters are adjusted during each training step.

Here, Ξ±\alpha is the learning rate, a hyperparameter that determines the size of the adjustment applied to the weights and biases during each update step.

question mark

During backpropagation, how does a neural network update its weights and biases to minimize the loss function?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 7

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 4

bookBackward Propagation

Swipe to show menu

Backward propagation, or backpropagation, is the process of determining how the loss function changes with respect to each parameter in the neural network. The goal is to adjust these parameters in a way that reduces the overall loss.

This process relies on the gradient descent algorithm, which uses the derivatives of the loss with respect to each layer’s pre-activation values (the raw outputs before applying the activation function) and propagates them backward through the network.

Since every layer contributes to the final prediction, the gradients are computed step by step:

  1. Perform forward propagation to obtain the outputs;
  2. Compute the derivative of the loss with respect to the output pre-activation;
  3. Propagate this derivative backward through the layers using the chain rule;
  4. Compute and use the gradients for weights and biases to update them during training.
Note
Note

Gradients represent the rate of change of a function with respect to its inputs, meaning they are its derivatives. They indicate how much a small change in weights, biases, or activations affects the loss function, guiding the model's learning process through gradient descent.

Notation

To make explanation clearer, let's use the following notation:

  • WlW^l is the weight matrix of layer ll;
  • blb^l is the vector of biases of layer ll;
  • zlz^l is the vector of pre-activations of layer ll;
  • ala^l is the vector of activation of layer ll;

Therefore, setting a0a^0 to xx (the inputs), forward propagation in a perceptron with n layers can be described as the following sequence of operations:

a0=x,......z1=W1a0+b1,zl=Wlalβˆ’1+bl,zn=Wnanβˆ’1+bn,a1=f1(z1),al=fl(zl),an=fn(zn),......y^=an.\begin{aligned} a^0 &= x, & &... & &...\\ z^1 &= W^1 a^0 + b^1, & z^l &= W^l a^{l-1} + b^l, & z^n &= W^n a^{n-1} + b^n,\\ a^1 &= f^1(z^1), & a^l &= f^l(z^l), & a^n &= f^n(z^n),\\ &... & &... & \hat y &= a^n. \end{aligned}

To describe backpropagation mathematically, introducing the following notations:

  • dalda^l: derivative of the loss with respect to the activations at layer ll;
  • dzldz^l: derivative of the loss with respect to the pre-activations at layer ll (before applying the activation function);
  • dWldW^l: derivative of the loss with respect to the weights at layer ll;
  • dbldb^l: derivative of the loss with respect to the biases at layer ll.

Computing Gradients for the Output Layer

At the final layer nn, the first step is to compute the gradient of the loss with respect to the activations of the output layer, denoted as danda^n.

Then, using the chain rule, the gradient of the loss with respect to the pre-activations of the output layer is calculated as:

dzn=danβŠ™fβ€²n(zn)dz^n = da^n \odot f'^n(z^n)

Here, fβ€²n(zn)f'^n(z^n) represents the derivative of the activation function at layer nn, and the symbol βŠ™\odot denotes element-wise multiplication.

Note
Note

The symbol βŠ™\odot denotes element-wise multiplication, which means each element of one vector is multiplied by the corresponding element of another vector. In contrast, the symbol β‹…\cdot represents the dot product, used for standard matrix or vector multiplication. The term fβ€²nf'^n refers to the derivative of the activation function at the output layer.

This value indicates how sensitive the loss function is to changes in the pre-activation values of the output layer.

After computing dzndz^n, the next step is to calculate the gradients for the weights and biases:

dWn=dznβ‹…(anβˆ’1)T,dbn=dzn\begin{aligned} dW^n &= dz^n \cdot (a^{n-1})^T,\\ db^n &= dz^n \end{aligned}

These gradients describe how much each weight and bias in the output layer should be adjusted to reduce the loss.

Here, (anβˆ’1)T(a^{n-1})^T is the transposed activation vector from the previous layer. If the original vector has the shape nneuronsΓ—1n_{\text{neurons}} \times 1, its transpose has the shape 1Γ—nneurons1 \times n_{\text{neurons}}.

To continue the backward propagation, the derivative of the loss with respect to the activations of the previous layer is computed as:

danβˆ’1=(Wn)Tβ‹…dznda^{n-1} = (W^n)^T \cdot dz^n

This expression allows the error signal to be passed backward through the network, enabling the adjustment of earlier layers during training.

Propagating Gradients to the Hidden Layers

For each hidden layer ll the procedure is the same. Given dalda^l:

  1. Compute the derivative of the loss with respect to the pre-activations;
  2. Compute the gradients for the weights and biases;
  3. Compute dalβˆ’1da^{l-1} to propagate the derivative backward.
dzl=dalβŠ™fβ€²l(zl)dWl=dzlβ‹…(alβˆ’1)Tdbl=dzldalβˆ’1=(Wl)Tβ‹…dzl\begin{aligned} dz^l &= da^l \odot f'^l(z^l)\\ dW^l &= dz^l \cdot (a^{l-1})^T\\ db^l &= dz^l\\ da^{l-1} &= (W^l)^T \cdot dz^l \end{aligned}

This process is repeated for each preceding layer, step by step, until the input layer is reached.

Updating Weights and Biases

After computing the gradients for all layers, the weights and biases are updated using the gradient descent algorithm:

Wl=Wlβˆ’Ξ±β‹…dWl,bl=blβˆ’Ξ±β‹…dbl.\begin{aligned} W^l &= W^l - \alpha \cdot dW^l,\\ b^l &= b^l - \alpha \cdot db^l. \end{aligned}

Here, Ξ±\alpha represents the learning rate, which controls how much the parameters are adjusted during each training step.

Here, Ξ±\alpha is the learning rate, a hyperparameter that determines the size of the adjustment applied to the weights and biases during each update step.

question mark

During backpropagation, how does a neural network update its weights and biases to minimize the loss function?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 7
some-alt