Backward Propagation
Backward propagation, or backpropagation, is the process of determining how the loss function changes with respect to each parameter in the neural network. The goal is to adjust these parameters in a way that reduces the overall loss.
This process relies on the gradient descent algorithm, which uses the derivatives of the loss with respect to each layerβs pre-activation values (the raw outputs before applying the activation function) and propagates them backward through the network.
Since every layer contributes to the final prediction, the gradients are computed step by step:
- Perform forward propagation to obtain the outputs;
- Compute the derivative of the loss with respect to the output pre-activation;
- Propagate this derivative backward through the layers using the chain rule;
- Compute and use the gradients for weights and biases to update them during training.
Gradients represent the rate of change of a function with respect to its inputs, meaning they are its derivatives. They indicate how much a small change in weights, biases, or activations affects the loss function, guiding the model's learning process through gradient descent.
Notation
To make explanation clearer, let's use the following notation:
- Wl is the weight matrix of layer l;
- bl is the vector of biases of layer l;
- zl is the vector of pre-activations of layer l;
- al is the vector of activation of layer l;
Therefore, setting a0 to x (the inputs), forward propagation in a perceptron with n layers can be described as the following sequence of operations:
a0z1a1β=x,=W1a0+b1,=f1(z1),...βzlalβ...=Wlalβ1+bl,=fl(zl),...βznany^ββ...=Wnanβ1+bn,=fn(zn),=an.βTo describe backpropagation mathematically, introducing the following notations:
- dal: derivative of the loss with respect to the activations at layer l;
- dzl: derivative of the loss with respect to the pre-activations at layer l (before applying the activation function);
- dWl: derivative of the loss with respect to the weights at layer l;
- dbl: derivative of the loss with respect to the biases at layer l.
Computing Gradients for the Output Layer
At the final layer n, the first step is to compute the gradient of the loss with respect to the activations of the output layer, denoted as dan.
Then, using the chain rule, the gradient of the loss with respect to the pre-activations of the output layer is calculated as:
dzn=danβfβ²n(zn)Here, fβ²n(zn) represents the derivative of the activation function at layer n, and the symbol β denotes element-wise multiplication.
The symbol β denotes element-wise multiplication, which means each element of one vector is multiplied by the corresponding element of another vector. In contrast, the symbol β represents the dot product, used for standard matrix or vector multiplication. The term fβ²n refers to the derivative of the activation function at the output layer.
This value indicates how sensitive the loss function is to changes in the pre-activation values of the output layer.
After computing dzn, the next step is to calculate the gradients for the weights and biases:
dWndbnβ=dznβ (anβ1)T,=dznβThese gradients describe how much each weight and bias in the output layer should be adjusted to reduce the loss.
Here, (anβ1)T is the transposed activation vector from the previous layer. If the original vector has the shape nneuronsβΓ1, its transpose has the shape 1Γnneuronsβ.
To continue the backward propagation, the derivative of the loss with respect to the activations of the previous layer is computed as:
danβ1=(Wn)Tβ dznThis expression allows the error signal to be passed backward through the network, enabling the adjustment of earlier layers during training.
Propagating Gradients to the Hidden Layers
For each hidden layer l the procedure is the same. Given dal:
- Compute the derivative of the loss with respect to the pre-activations;
- Compute the gradients for the weights and biases;
- Compute dalβ1 to propagate the derivative backward.
This process is repeated for each preceding layer, step by step, until the input layer is reached.
Updating Weights and Biases
After computing the gradients for all layers, the weights and biases are updated using the gradient descent algorithm:
Wlblβ=WlβΞ±β dWl,=blβΞ±β dbl.βHere, Ξ± represents the learning rate, which controls how much the parameters are adjusted during each training step.
Here, Ξ± is the learning rate, a hyperparameter that determines the size of the adjustment applied to the weights and biases during each update step.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 4
Backward Propagation
Swipe to show menu
Backward propagation, or backpropagation, is the process of determining how the loss function changes with respect to each parameter in the neural network. The goal is to adjust these parameters in a way that reduces the overall loss.
This process relies on the gradient descent algorithm, which uses the derivatives of the loss with respect to each layerβs pre-activation values (the raw outputs before applying the activation function) and propagates them backward through the network.
Since every layer contributes to the final prediction, the gradients are computed step by step:
- Perform forward propagation to obtain the outputs;
- Compute the derivative of the loss with respect to the output pre-activation;
- Propagate this derivative backward through the layers using the chain rule;
- Compute and use the gradients for weights and biases to update them during training.
Gradients represent the rate of change of a function with respect to its inputs, meaning they are its derivatives. They indicate how much a small change in weights, biases, or activations affects the loss function, guiding the model's learning process through gradient descent.
Notation
To make explanation clearer, let's use the following notation:
- Wl is the weight matrix of layer l;
- bl is the vector of biases of layer l;
- zl is the vector of pre-activations of layer l;
- al is the vector of activation of layer l;
Therefore, setting a0 to x (the inputs), forward propagation in a perceptron with n layers can be described as the following sequence of operations:
a0z1a1β=x,=W1a0+b1,=f1(z1),...βzlalβ...=Wlalβ1+bl,=fl(zl),...βznany^ββ...=Wnanβ1+bn,=fn(zn),=an.βTo describe backpropagation mathematically, introducing the following notations:
- dal: derivative of the loss with respect to the activations at layer l;
- dzl: derivative of the loss with respect to the pre-activations at layer l (before applying the activation function);
- dWl: derivative of the loss with respect to the weights at layer l;
- dbl: derivative of the loss with respect to the biases at layer l.
Computing Gradients for the Output Layer
At the final layer n, the first step is to compute the gradient of the loss with respect to the activations of the output layer, denoted as dan.
Then, using the chain rule, the gradient of the loss with respect to the pre-activations of the output layer is calculated as:
dzn=danβfβ²n(zn)Here, fβ²n(zn) represents the derivative of the activation function at layer n, and the symbol β denotes element-wise multiplication.
The symbol β denotes element-wise multiplication, which means each element of one vector is multiplied by the corresponding element of another vector. In contrast, the symbol β represents the dot product, used for standard matrix or vector multiplication. The term fβ²n refers to the derivative of the activation function at the output layer.
This value indicates how sensitive the loss function is to changes in the pre-activation values of the output layer.
After computing dzn, the next step is to calculate the gradients for the weights and biases:
dWndbnβ=dznβ (anβ1)T,=dznβThese gradients describe how much each weight and bias in the output layer should be adjusted to reduce the loss.
Here, (anβ1)T is the transposed activation vector from the previous layer. If the original vector has the shape nneuronsβΓ1, its transpose has the shape 1Γnneuronsβ.
To continue the backward propagation, the derivative of the loss with respect to the activations of the previous layer is computed as:
danβ1=(Wn)Tβ dznThis expression allows the error signal to be passed backward through the network, enabling the adjustment of earlier layers during training.
Propagating Gradients to the Hidden Layers
For each hidden layer l the procedure is the same. Given dal:
- Compute the derivative of the loss with respect to the pre-activations;
- Compute the gradients for the weights and biases;
- Compute dalβ1 to propagate the derivative backward.
This process is repeated for each preceding layer, step by step, until the input layer is reached.
Updating Weights and Biases
After computing the gradients for all layers, the weights and biases are updated using the gradient descent algorithm:
Wlblβ=WlβΞ±β dWl,=blβΞ±β dbl.βHere, Ξ± represents the learning rate, which controls how much the parameters are adjusted during each training step.
Here, Ξ± is the learning rate, a hyperparameter that determines the size of the adjustment applied to the weights and biases during each update step.
Thanks for your feedback!