Author: CookedSashimi

Deriving backpropagation using diagrams

Introduction

After writing my first blog post on deriving batch backpropagation, it always bothered me that the post seemed to have too little diagrams. So, in this post, my goal is to derive batch backpropagation by using as many diagrams as possible.

First, let me introduce you to a diagram that we will be using for the remainder of the post.

Forward pass.png
Figure 1: One layer feed-forward neural network with squared loss, and a batch size of 2

It might not seem obvious at first, but the diagram below is a one layer feed-forward neural network, with a batch size of two (denoted by the light and dark red colors) that outputs a single value which is scored by the squared error loss. Though, pictures probably explain it better than I do. Below is a diagram of the first row of the data matrix X multiplied by the first column of the weight matrix W^{1}

Forward pass only 1.png
Figure 2: First row of data matrix, multiplied by the first column of the first layer of weights

I’m going to assume that you already have a good idea of what happens in the forward pass so the diagram above should (hopefully) make sense (though if not please comment down below and I will explain further!) Now before we get into deriving backpropagation lets first define some notation

Notation

Input variables:

X = \begin{bmatrix} x_{1, 1} & x_{1, 2} \\ x_{2, 1} & x_{2, 2} \end{bmatrix}

First layer weights:

W^{1} = \begin{bmatrix} w^{1}_{1, 1} & w^{1}_{1, 2} &w^{1}_{1, 3} \\ w^{1}_{2, 1} & w^{1}_{2, 2} & w^{1}_{2, 3} \end{bmatrix}

First layer bias:

B^{1} = \begin{bmatrix} b^{1}_1 & b^{1}_2 & b^{1}_3 \\b^{1}_1 & b^{1}_2 & b^{1}_3 \end{bmatrix}

First layer linear combination:

L^{1} = XW^{1} + B^{1}= \begin{bmatrix} l^{1}_{1,1} & l^{1}_{1, 2} & l^{1}_{1, 3} \\ l^{1}_{2, 1} & l^{1}_{2, 2} & l^{1}_{2, 3} \end{bmatrix} = \begin{bmatrix} x_{1, 1} w^{1}_{1, 1} + x_{1, 2} w^{1}_{2, 1}+b^{1}_1 & x_{1, 1} w^{1}_{1, 2} + x_{1, 2} w^{1}_{2, 2}+b^{1}_2 & x_{1, 1} w^{1}_{1, 3} + x_{1, 2} w^{1}_{2, 3}+b^{1}_3 \\ x_{2, 1} w^{1}_{1, 1} + x_{2, 2} w^{1}_{2, 1}+b^{1}_1 & x_{2, 1} w^{1}_{1, 2} + x_{2, 2} w^{1}_{2, 2}+b^{1}_2 & x_{2, 1} w^{1}_{1, 3} + x_{2, 2} w^{1}_{2, 3}+b^{1}_3 \end{bmatrix}

First hidden layer:

H^{1} =\begin{bmatrix} \sigma(l^{1}_{1,1}) & \sigma(l^{1}_{1, 2}) &\sigma(l^{1}_{1, 3}) \\ \sigma(l^{1}_{2, 1}) & \sigma(l^{1}_{2, 2}) & \sigma(l^{1}_{2, 3}) \end{bmatrix}

Output layer weights:

W^{2} = \begin{bmatrix} w^{2}_{1, 1} \\ w^{2}_{2,1} \\ w^{2}_{3,1} \end{bmatrix}

Output layer bias:

B^{2} = \begin{bmatrix} b^{2}_1 \\ b^{2}_1\end{bmatrix}

Output layer:

\hat{Y} = H^{1}W^{2} + B^{2} = \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2  \end{bmatrix}

Loss:

L(\hat{y}_{1,1}, y_{1,1}, \hat{y}_{2,1}, y_{2,1}) = \frac{1}{2}\sum_{i=1}^{2}(\hat{y}_{i,1} - y_{i,1})^{2}

Deriving backpropagation by following lines

Before we start getting into the details, we should first understand how backpropagation works from a high level. The goal of backpropagation is to calculate how much the final error value changes given a change some node (be it a weight node or a hidden variable node).

This is done through the accumulation of gradients starting some root node, our loss node, to the leaf nodes (and any nodes along the way) that we are interested in, usually our weight and hidden variable nodes. These gradients are represented as arrows in the diagrams that are shown below.

In fact, all that you need to be able to do to understand backpropagation is to follow the arrows and multiplying their values along the way until you reach the end. This is the approach that we will be taking, we will slowly build up arrows that lead from the loss node to our input variable nodes.

During backpropagation there are 5 steps that you will need to follow, assuming that we have a neural network with L layers:

  1. Assuming we are on layer l in the backward pass, using the previously accumulated gradients from previous layers, which we will denote using the matrix \mathbf{\delta^{L - l}}, calculate the accumulated gradients for the weights of the current layer
  2. Using \mathbf{\delta^{l}} calculate the accumulated gradients for the biases of the current layer
  3. Using \mathbf{\delta^{l}} calculate the accumulated gradients of the inputs of the current layer
  4. Using the result from the previous step, calculate the accumulated gradients of the inputs of the current layer before the non-linearity, and then abstract these values into a matrix \mathbf{\delta^{L - ( l+ 1)}}
  5. Start go back to step 1, except now we are on layer l - 1

The loss layer: setting up \mathbf{\delta^{1}}

As mentioned above, each of the arrows represents the change in the root node with respect to the leaf node(s). For example, in figure 3, the root node is our loss function, and the leaf nodes are our predicted values.

Loss layer.png
Figure 3: The Loss layer

As we are using the squared error loss, the derivative of our loss function with respect to our predicted variables \hat{Y} is relatively simple. So by doing some simple math, we should be able to calculate :

\frac{\partial{L(\hat{y}_{1,1}, y_{1,1}, \hat{y}_{2,1}, y_{2,1})}}{\partial{\hat{Y}}} = \begin{bmatrix} \frac{\partial{L(\hat{y}_{1,1}, y_{1,1}, \hat{y}_{2,1}, y_{2,1})}}{\partial{\hat{y_{1,1}}}} \\ \frac{\partial{L(\hat{y}_{1,1}, y_{1,1}, \hat{y}_{2,1}, y_{2,1})}}{\partial{\hat{y_{2,1}}}} \end{bmatrix} = \begin{bmatrix} \hat{y_{1,1}} - y_{1,1} \\ \hat{y_{2,1}} - y_{2,1} \end{bmatrix}

Now we make the abstraction, I will denote:

\mathbf{\delta^1} = \begin{bmatrix}\delta^{1}_{1,1} \\ \delta^{1}_{2, 1}\end{bmatrix} = \begin{bmatrix} \hat{y_{1,1}} - y_{1,1} \\ \hat{y_{2,1}} - y_{2,1} \end{bmatrix}

So let’s recap, what \mathbf{\delta^1} means. Each row of \mathbf{\delta^1} represents the derivative of the loss function with respect to our predicted values. Another way to think about it is the accumulated gradients of the loss function up to the nodes; in our case \hat{y_{1,1}} and \hat{y_{2,1}}. In fact, the latter interpretation is fundamental to deriving the backpropagation algorithm as the \mathbf{\delta^{L-l}} matrix holds all of the accumulated gradients from previous layers and allows us to abstract away those previous calculation! If you don’t get what I mean, keep reading and hopefully it will become more clear.

Alright sweet, now let us move onwards! The next step will be to calculate the gradients for the weights.

Step 1: Calculating the accumulated gradients for W^{2}

To start, we won’t be calculating the accumulated gradients for all the output weights at once, but rather we will start by only considering w^{2}_{1,1}. As once we see one example, calculating the accumulated gradients for w^{2}_{2,1} and w^{2}_{3,1} is very similar.

Layer 2 weight only 1.png
Figure 4: Accumulated gradients for w^2_{1, 1}

In figure 4, we see that for w^{2}_{1,1}, there are two arrows leading into it, one from \hat{y_{1,1}} and one from \hat{y_{2,1}}. If you haven’t already, you should try to understand why this is happening by yourself before reading the explanation!

Now for the explanation… Since w^{2}_{1,1}, is used to calculate both \hat{y_{1,1}} and \hat{y_{2,1}} a change w^{2}_{1,1} would change both\hat{y_{1,1}} and \hat{y_{2,1}}! Therefore, when w^{2}_{1,1} changes it is able to change the final loss function in two ways through\hat{y_{1,1}} and \hat{y_{2,1}}, and as a result we have two arrows (representing gradients) heading into w^{2}_{1,1}.

Now looking at the diagram we can follow the arrows from the loss node that lead to the weight node. We can see that the accumulated gradients for w^{2}_{1,1} is simply:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{2}_{1,1}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \frac{\partial{\hat{y}_{1, 1}}}{\partial{w^{2}_{1,1}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \frac{\partial{\hat{y}_{2, 1}}}{\partial{w^{2}_{1,1}}}

= (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,1})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,1})

Now if we follow the same steps, and follow the arrows that lead from the loss node to the w^{2}_{2,1} and w^{2}_{3,1}

We can see that:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{2}_{i,1}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \frac{\partial{\hat{y}_{1, 1}}}{\partial{w^{2}_{i,1}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \frac{\partial{\hat{y}_{2, 1}}}{\partial{w^{2}_{i,1}}}

= (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,i})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,i})

Now if we do this for all the weights! We end up with this diagram below!

Layer 2 weights.png
Figure 5: Accumulated gradients for all the output weights

 

Step 2: Calculating the accumulated gradients for B^{2}

To calculate the accumulated gradients for our bias we simply follow the same thing that we did for the weights!

Screen Shot 2017-10-28 at 1.50.39 AM.png
Figure 6: Accumulated gradients for b^{2}_{1,1}

Again, there are two arrows pointing into b^{2}_{1,1}, as changing  b^{2}_{1,1} would change both \hat{y_{1, 1}} and \hat{y_{2, 1}}. Now to calculate the accumulated gradients we simply add up the gradients (like before!).

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{b^{2}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \frac{\partial{\hat{y}_{1, 1}}}{\partial{b^{2}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \frac{\partial{\hat{y}_{2, 1}}}{\partial{b^{2}}}

= (\hat{y}_{1, 1} - y_{1, 1})(1)+(\hat{y}_{2, 1} - y_{2, 1})(1)

= (\hat{y}_{1, 1} - y_{1, 1}) +(\hat{y}_{2, 1} - y_{2, 1})

Now lets move on to calculating the accumulated gradients with respect to the first hidden layer!

Step 3: Calculating the accumulated gradients for H^{1}

To calculate the accumulated gradients for the hidden layer, we again just follow the lines!

Layer 2 hidden.png
Figure 7: Accumulated gradients for H^{2}

Looking at figure 7, we can see that:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{i,j}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{i,1}}} \frac{\partial{\hat{y}_{i,1}}}{\partial{h^{1}_{i,j}}}=(\hat{y_{i,1}}-y_{i,1})(w^{2}_{j,1})

Summary of the accumulated gradients for the output layer

So we have successfully calculated the accumulated gradients for the output layer! Here is how our diagram looks!

Screen Shot 2017-10-28 at 2.06.38 AM
Figure 8: The loss and output layer

Now, lets take a look at the equations that calculate the accumulated gradients from the errors we derived from the previous sections:

The weights:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{2}_{i,1}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \frac{\partial{\hat{y}_{1, 1}}}{\partial{w^{2}_{i,1}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \frac{\partial{\hat{y}_{2, 1}}}{\partial{w^{2}_{i,1}}}

= (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,i})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,i})

The bias:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{b^{2}_1}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \frac{\partial{\hat{y}_{1, 1}}}{\partial{b^{2}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \frac{\partial{\hat{y}_{2, 1}}}{\partial{b^{2}}}

= (\hat{y}_{1, 1} - y_{1, 1}) +(\hat{y}_{2, 1} - y_{2, 1})

The hidden variables:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{i,j}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{i,1}}} \frac{\partial{\hat{y}_{i,1}}}{\partial{h^{1}_{i,j}}}

=(\hat{y_{i,1}}-y_{i,1})(w^{2}_{j,1})

Now, we can see that \hat{y}_{1, 1} - y_{1, 1} and \hat{y}_{2, 1} - y_{2, 1} appear in all the equations. So we know that our equations will involve:

\mathbf{\delta^1} = \begin{bmatrix}\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \\\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \end{bmatrix} = \begin{bmatrix} \hat{y_{1, 1}} - y_{1, 1} \\ \hat{y_{2, 1}} - y_{2, 1} \end{bmatrix}

Looking at the accumulated gradients for the weights W^{2} we can see that:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{W^2}} = \begin{bmatrix} \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{1}_{1,1}}} \\ \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{1}_{2,1}}} \\ \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{1}_{3,1}}}\end{bmatrix}

=\begin{bmatrix} (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,1})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,1}) \\ (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,2})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,2}) \\ (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,3})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,3}) \end{bmatrix}

= \begin{bmatrix} h^1_{1, 1} & h^1_{2, 1} \\ h^1_{1, 2} & h^1_{2, 2}\\ h^1_{1, 3} & h^1_{2, 3} \end{bmatrix} \begin{bmatrix} \hat{y_{1, 1}} - y_{1, 1} \\ \hat{y_{2, 1}} - y_{2, 1} \end{bmatrix}

= (H^{1})^{T} \mathbf{\delta^{1}}

For the biases, we simply sum up columns of \mathbf{\delta^{1}}:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{B^2}} = \begin{bmatrix}\sum_{i=1}^{2}\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{i, 1}}}\end{bmatrix}

= \begin{bmatrix} (\hat{y}_{1, 1} - y_{1, 1}) +(\hat{y}_{2, 1} - y_{2, 1}) \end{bmatrix}

= \begin{bmatrix} \sum_{i=1}^{2}\delta^{1}_{i,1}\end{bmatrix}

Finally, for the hidden states H^{1}:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{H^1}} = \begin{bmatrix} \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,1}}} & \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,2}}} & \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,3}}} \\\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,1}}} &\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,2}}} &\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,3}}}\end{bmatrix}

= \begin{bmatrix} (\hat{y_{1, 1}}-y_{1, 1})(w^{2}_{1,1}) &(\hat{y_{1, 1}}-y_{1, 1})(w^{2}_{2,1}) &(\hat{y_{1, 1}}-y_{1, 1})(w^{2}_{3,1}) \\(\hat{y_{2, 1}}-y_{2, 1})(w^{2}_{1,1}) &(\hat{y_{2, 1}}-y_{2, 1})(w^{2}_{2,1}) &(\hat{y_{2, 1}}-y_{2, 1})(w^{2}_{3,1}) \end{bmatrix}

= \begin{bmatrix} \hat{y_{1, 1}} - y_{1, 1} \\ \hat{y_{2, 1}} - y_{2, 1} \end{bmatrix} \begin{bmatrix} w^{2}_{1, 1} & w^{2}_{2,1} & w^{2}_{3,1} \end{bmatrix}

=\mathbf{\delta^{1}}(W^2)^{T}

Step 4: Calculating the accumulated gradients for L^{1}

Alright! So we are nearly there, for the last step we need to find the derivative of the hidden layer with respect to the non-linearities, in the case of our neural network, we have used a sigmoid function! This step is the crucial step, as after this step, we can simply rinse and repeat.

A sigmoid function is defined as:

\sigma(\cdot) = \frac{1}{1+exp(-(\cdot))}

It’s derivative is defined as:

\sigma'(\cdot) = \sigma(\cdot)(1-\sigma(\cdot))

I will leave the derivation of the derivative as an exercise! For now we will continue on with the last step.

Layer 2 linear
Figure 9: Accumulated gradients for H^{1}

Looking at the diagram above if we follow the arrow from the loss to the nodes in the linear layer L^{1} then we should be able to see that for each node in the linear layer:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^1_{i,j}}}=\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^1_{i,j}}}\frac{\partial{h^1_{i,j}}}{\partial{l^1_{i,j}}}

Since each element of H^1 is just L^{1} with a sigmoid to it:

\frac{\partial{h^1_{i,j}}}{\partial{l^1_{i,j}}} =\frac{\partial{\sigma (l^1_{i,j})}}{\partial{l^1_{i,j}}}  = \sigma(l^1_{i,j})(1-\sigma(l^1_{i,j}))

Or put more simply, the accumulated gradients from the loss layer up to the inputs of the layer before the non-linearity is calculated as:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^1_{i,j}}}

=\begin{bmatrix} \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,1}}} & \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,2}}} & \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,3}}} \\\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,1}}} &\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,2}}} &\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,3}}}\end{bmatrix} \circ \begin{bmatrix} \frac{\partial{h^{1}_{1,1}}}{\partial{l^{1}_{1,1}}} & \frac{\partial{h^{1}_{1,2}}}{\partial{l^{1}_{1,2}}} & \frac{\partial{h^{1}_{1,3}}}{\partial{l^{1}_{1,3}}} \\ \frac{\partial{h^{1}_{2,1}}}{\partial{l^{1}_{2,1}}} & \frac{\partial{h^{1}_{2,2}}}{\partial{l^{1}_{2,2}}} & \frac{\partial{h^{1}_{2,3}}}{\partial{l^{1}_{2,3}}} \end{bmatrix}

= \mathbf{\delta^{1}}(W^2)^{T} \circ {\sigma'(L^{1})}

There are two things that I should explain here the circle, \circ is called the Hadamard product, which is an element-wise multiplication of two vectors/matrices with the same dimension.

Also, I have abused the notation a bit where I am denoting \sigma'(L^{1}) as:

{\sigma'(L^{1})} =\begin{bmatrix} \frac{\partial{\sigma(l^{1}_{1,1})}}{\partial{l^{1}_{1,1}}} & \frac{\partial{\sigma(l^{1}_{1,2})}}{\partial{l^{1}_{1,2}}} & \frac{\partial{\sigma(l^{1}_{1,3})}}{\partial{l^{1}_{1,3}}} \\ \frac{\partial{\sigma(l^{1}_{2,1})}}{\partial{l^{1}_{2,1}}} & \frac{\partial{\sigma(l^{1}_{2,2})}}{\partial{l^{1}_{2,2}}} & \frac{\partial{\sigma(l^{1}_{2,3})}}{\partial{l^{1}_{2,3}}} \end{bmatrix}=\begin{bmatrix} \frac{\partial{h^{1}_{1,1}}}{\partial{l^{1}_{1,1}}} & \frac{\partial{h^{1}_{1,2}}}{\partial{l^{1}_{1,2}}} & \frac{\partial{h^{1}_{1,3}}}{\partial{l^{1}_{1,3}}} \\ \frac{\partial{h^{1}_{2,1}}}{\partial{l^{1}_{2,1}}} & \frac{\partial{h^{1}_{2,2}}}{\partial{l^{1}_{2,2}}} & \frac{\partial{h^{1}_{2,3}}}{\partial{l^{1}_{2,3}}} \end{bmatrix}

Finally, we are able to make the abstraction for \mathbf{\delta^{2}}:

\mathbf{\delta^{2}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^1_{i,j}}} = \mathbf{\delta^{1}}(W^2)^{T} \circ {\sigma'(L^{1})}

Each element in row i and column j in this \mathbf{\delta^{2}} are the accumulated gradients from the error up to l^{1}_{i,j}.

Step 5: Rinse and repeat

Here is our updated diagram!

Final.png
Figure 10: Accumulated gradients for L^{1}

Now I am not going into as much detail, though I do urge you to follow the steps that I just went through to find the accumulated gradients for the weights, biases, and inputs for the input layer.

Input weights
Figure 11: Accumulated gradients for w^{1}_{1,1}

Looking at figure 11 we can see that:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{1}_{i,j}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^1_{1, j}}} \frac{\partial{l^1_{1, j}}}{\partial{w^{1}_{i,j}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^1_{2, j}}} \frac{\partial{l^1_{2, j}}}{\partial{w^{1}_{i,j}}}

= \delta^{2}_{1,j}x_{1,j}+\delta^{2}_{2,j}x_{2,j}

Or more generally for any batch size B and any layer l:

\frac{\partial{L}}{\partial{w^{l}_{i,j}}} = \sum_{b=1}^{B}\frac{\partial{L}}{\partial{l^l_{b, j}}} \frac{\partial{l^l_{b, j}}}{\partial{w^{l}_{i,j}}}=\delta^{L-l}_{b,j}h^{l-1}_{b,j}

Which is the (b, j)th element of:

\frac{\partial{L}}{\partial{W^{l}}} = (H^{l-1})^T\mathbf{\delta^{L-l}}

where:

H^0 = X

Note that in our neural network L=3 we have the input layer, the hidden layer, and the output layer.

Input bias
Figure 12: Accumulated gradients for b^{1}_{1}

Looking at figure 12 we can see that:

\frac{\partial{L}}{\partial{B^{1}}} = \begin{bmatrix} \sum_{b=1}^{2}\delta^{l}_{b,1}  &\sum_{b=1}^{2}\delta^{l}_{b,2}  & \sum_{b=1}^{2}\delta^{l}_{b,3} \end{bmatrix}

Or more generally, for any batch size B and layer l:

\frac{\partial{L}}{\partial{B^{l}}} = \begin{bmatrix} \sum_{b=1}^{B}\delta^{l}_{b,1} \dots\sum_{b=1}^{B}\delta^{l}_{b,D^{l}} \end{bmatrix}

where:

D^{l} is the number of hidden nodes in layer l

Input layer
Figure 13: Accumulated gradients for x_{1,1} or alternatively h^0_{1,1}

Looking at figure 13, we can see that:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{x_{1,1}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{1, 1}}} \frac{\partial{l^l_{1, 1}}}{\partial{x_{1,1}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{1, 2}}} \frac{\partial{l^l_{1, 2}}}{\partial{x_{1,2}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{1, 3}}} \frac{\partial{l^l_{1, 3}}}{\partial{x_{i,j}}}

=\delta^1_{1,1}w^{1}_{1,1} + \delta^1_{1,2}w^{1}_{1,2} + \delta^1_{1,3}w^{1}_{1,3}

Or more generally, for any batch size B and layer l:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{x_{b,j}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{i, 1}}} \frac{\partial{l^l_{b, 1}}}{\partial{x_{b,j}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{b, 2}}} \frac{\partial{l^l_{b, 2}}}{\partial{x_{b,j}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{b, 3}}} \frac{\partial{l^l_{b, 3}}}{\partial{x_{b,j}}}

=\delta^{L-l}_{b,1}w^{l}_{j,1} + \delta^{L-l}_{b,2}w^{l}_{j,2} + \delta^{L-l}_{b,3}w^{l}_{j,3}

Which is the (b, j)th element of:

\frac{\partial{L}}{\partial{H^{l}}} = \mathbf{\delta^{L-l}}(W^{l})^T

 

Conclusion

So we have derived the backpropagation equations:

\frac{\partial{L}}{\partial{W^{l}}} = (H^{l-1})^T\mathbf{\delta^{L-l}}

\frac{\partial{L}}{\partial{B^{l}}} = \begin{bmatrix} \sum_{b=1}^{B}\delta^{l}_{b,1} \dots\sum_{b=1}^{B}\delta^{l}_{b,D^{l}} \end{bmatrix}

\frac{\partial{L}}{\partial{H^{l}}} = \mathbf{\delta^{L-l}}(W^{l})^T

If you are interested in playing around with the diagrams that I have used here is the link (please copy the diagrams and place them in a new file so that other people can use the template). I would urge you to place around with it and derive the backpropagation equations in detail for the input layer!

Also, if you’re interested in a more non-trivial example please take a look at my other post which is an example of backpropagation in a four layer neural network using cross entropy loss.

I hope that this has made backpropagation a lot clearer! Please let me know in the comments below if I have made any mistakes or if you have any questions or any requests on what I should blog about next!

Advertisements

An example of backpropagation in a four layer neural network using cross entropy loss

Introduction

Update: I have written another post deriving backpropagation which has more diagrams and I recommend reading the aforementioned post first!

The backpropagation algorithm can be argued to be the most important contribution to the field of deep learning. In fact, it is because of this algorithm, and the increasing power of GPUs, that the deep learning era we are experiencing today is even possible!

There are many great articles online that explain how backpropagation work (my favourite is Christopher Olah’s post), but not many examples of backpropagation in a non-trivial setting. Examples I found online only showed backpropagation on simple neural networks (1 input layer, 1 hidden layer, 1 output layer) and they only used 1 sample data during the backward pass.

The problem of using this simple example is two fold:

  1. It misses out on the main concept of the backpropagation algorithm: reusing the gradients of the previously calculated layers through matrix multiplications
  2.  In practice, neural networks aren’t just trained by feeding it one sample at a time, but rather in batches (usually in powers of 2).

As a result, it was a struggle for me to make the mental leap from understanding how backpropagation worked in a trivial neural network to the current state of the art neural networks. These state of the art neural networks consist of many layers and are trained by feeding in batches of examples, not one by one.

So in this post I will attempt to work through the math of the backward pass of a four layer neural network; in the process of which I will explain how the backpropagation algorithm can be generalized to a neural network of arbitrary depth and take an arbitrary number of samples as input.

Diagram of our neural network

We start off by defining our 4 layer network:

4 layer NN.png

I am going to assume that you are comfortable with the forward pass in a neural network and so I’m going to jump straight into the backward pass! We will first start off with using only 1 sample in the backward pass, then afterwards we will see how to extend it to using more than 1 sample.

The output layer and loss function

The output layer of our neural network is a vector of probabilities from the softmax function whereby the inputs of the softmax function is a vector \mathbf{z}^3:

\mathbf{z}^3 = \begin{bmatrix} z^3_1 & z^3_2 & z^3_3 \end{bmatrix}

The k^{th} element of the output for the softmax function is:

\hat{y}_k = \mathbf{\sigma}(\mathbf{z}^{3})_k = \frac{exp(z^{3}_k)}{\sum_{i=1}^{3}exp(z_i^3)}

As the output of the softmax function (which is also our output layer) is multi-valued it can be succinctly represented by a vector, we will use the vector \mathbf{\hat{y}} as these are the predicted probabilities:

\mathbf{\hat{y}} = \begin{bmatrix} \hat{y}_1 & \hat{y}_2 & \hat{y}_3 \end{bmatrix}

The output of the softmax function \mathbf{\hat{y}} are then used as inputs to our loss function, the cross entropy loss:

H(\mathbf{\hat{y}},\mathbf{y}) := - \sum_{i=1}^3 y_{i} \log (\hat{y}_i)

where \mathbf{y} is a one-hot vector.

Now we have all the information that we need to start the first step of the backpropagation algorithm! Our goal is to find how our loss function H changes with respect to \mathbf{\hat{z}}^3. Since \mathbf{\hat{z}}^3 has 3 variables, we need to find how H changes with each of them. We can simplify this problem by first examining how H changes with z^3_1.

output with gradients.png

Now our goal is to find:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_1}

By looking at the diagram above we see that this is simply:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_1} = \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_1} \frac{\partial\hat{y}_1}{\partial z^3_1} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_2}\frac{\partial\hat{y}_2}{\partial z^3_1} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_3} \frac{\partial\hat{y}_3}{\partial z^3_1}

If this is not obvious to you, please read Christopher Olah’s post on how back propagation works!

If we change z^3_1 to z^3_2 or z^3_3 it should be easy to convince yourself that:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_2} = \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_1} \frac{\partial\hat{y}_1}{\partial z^3_2} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_2}\frac{\partial\hat{y}_2}{\partial z^3_2} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_3} \frac{\partial\hat{y}_3}{\partial z^3_2}

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_3} = \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_1} \frac{\partial\hat{y}_1}{\partial z^3_3} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_2}\frac{\partial\hat{y}_2}{\partial z^3_3} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_3} \frac{\partial\hat{y}_3}{\partial z^3_3}

In fact, after staring at it for a while we can see that:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \begin{bmatrix} \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_1} &\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_2} & \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_3} \end{bmatrix} = \begin{bmatrix} \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_1} &\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_2} & \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_3} \end{bmatrix} \begin{bmatrix} \frac{\partial\hat{y}_1}{\partial z^3_1} & \frac{\partial\hat{y}_1}{\partial z^3_2} & \frac{\partial\hat{y}_1}{\partial z^3_3}\\ \frac{\partial\hat{y}_2}{\partial z^3_1} & \frac{\partial\hat{y}_2}{\partial z^3_2} & \frac{\partial\hat{y}_2}{\partial z^3_3}\\ \frac{\partial\hat{y}_3}{\partial z^3_1} & \frac{\partial\hat{y}_3}{\partial z^3_2} & \frac{\partial\hat{y}_3}{\partial z^3_3} \end{bmatrix}

Where:

\begin{bmatrix} \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_1} &\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_2} & \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_3} \end{bmatrix} = \begin{bmatrix} -\frac{y_1}{\hat{y}_1} & -\frac{y_2}{\hat{y}_2} & -\frac{y_3}{\hat{y}_3} \end{bmatrix} =\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\mathbf{\hat{y}}}

is the partial derivative of our loss function H with its inputs \mathbf{\hat{y}}

 And:

\begin{bmatrix} \frac{\partial\hat{y}_1}{\partial z^3_1} & \frac{\partial\hat{y}_1}{\partial z^3_2} & \frac{\partial\hat{y}_1}{\partial z^3_3}\\ \frac{\partial\hat{y}_2}{\partial z^3_1} & \frac{\partial\hat{y}_2}{\partial z^3_2} & \frac{\partial\hat{y}_2}{\partial z^3_3}\\ \frac{\partial\hat{y}_3}{\partial z^3_1} & \frac{\partial\hat{y}_3}{\partial z^3_2} & \frac{\partial\hat{y}_3}{\partial z^3_3} \end{bmatrix} = \begin{bmatrix} \hat{y}_1(1-\hat{y}_1) & -\hat{y}_1\hat{y}_2 & -\hat{y}_1\hat{y}_3\\ -\hat{y}_2\hat{y}_1 & \hat{y}_2(1-\hat{y}_2) & -\hat{y}_2\hat{y}_3\\ -\hat{y}_3\hat{y}_1 & -\hat{y}_3\hat{y}_2 & \hat{y}_3(1-\hat{y}_3) \end{bmatrix}= \frac{\partial \mathbf{\hat{y}}}{\partial \mathbf{z}^3}

is the Jacobian of the softmax function (this might not immediately obvious but take it for granted now, I might do a post on deriving the Jacobian of the softmax function in the future!).

One interesting observation is that the columns of the Jacobian represents the edges leading into z^3_1, z^3_2, z^3_3. For example the first column of the Jacobian represents the edges leading into z^3_1 from \hat{y}_1,\hat{y}_2,\hat{y}_3.

In fact this something that we will see often during the derivation of backpropagation whereby the columns of the Jacobian between layer l and layer l-1  represents the edges leading in from layer l to a node in layer l-1.

So we are nearly there! We know that:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\mathbf{\hat{y}}}\frac{\partial \mathbf{\hat{y}}}{\partial \mathbf{z}^3} =\begin{bmatrix} -\frac{y_1}{\hat{y}_1} & -\frac{y_2}{\hat{y}_2} & -\frac{y_3}{\hat{y}_3} \end{bmatrix} \begin{bmatrix} \hat{y}_1(1-\hat{y}_1) & -\hat{y}_1\hat{y}_2 & -\hat{y}_1\hat{y}_3\\ -\hat{y}_2\hat{y}_1 & \hat{y}_2(1-\hat{y}_2) & -\hat{y}_2\hat{y}_3\\ -\hat{y}_3\hat{y}_1 & -\hat{y}_3\hat{y}_2 & \hat{y}_3(1-\hat{y}_3) \end{bmatrix}

So after the matrix multiplications we get:

\begin{bmatrix} (\sum_{i \neq 1}^3 y_i \hat{y}_1)-y_1(1- \hat{y_1}) & (\sum_{i \neq 2}^3y_i \hat{y}_2)-y_2(1- \hat{y_2}) & (\sum_{i \neq 3}^3y_i \hat{y}_3)-y_3(1- \hat{y_3}) \end{bmatrix}

Since \mathbf{y} is a one-hot vector it means that only one element of the vector \mathbf{y} be will 1 the rest will be 0s. Assuming that the class being represented by the first element of \mathbf{y} is true, then:

\mathbf{y} = \begin{bmatrix} 1 & 0 & 0 \end{bmatrix}

The result of the matrix multiplications becomes:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \begin{bmatrix}-(1- \hat{y_1}) & \hat{y_2} & \hat{y_3} \end{bmatrix} =\begin{bmatrix} \hat{y_1} - 1 & \hat{y_2} & \hat{y_3} \end{bmatrix}

Again, it should be easy to convince yourself that if the class being represented by the second element of \mathbf{y} is true, then:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \begin{bmatrix} \hat{y_1} & \hat{y_2} - 1 & \hat{y_3} \end{bmatrix}

So after all that hard work we see that \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} can simply be represented by:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \mathbf{\hat{y}} - \mathbf{y} =\delta^1 = \begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}

I want to take a bit of time to explain the importance of abstracting away the derivatives of the previous layers using \delta^1. We could try not using \delta^1 and keep it as \mathbf{\hat{y}} - \mathbf{y} although it seems fine now, as we go deeper backwards into the network we would have a long chain of variables that we would need to keep track of. This is a problem because it would be near impossible to implement in code and the derivation would be a lot harder to understand.

Hidden Layer 2

The second hidden layer of our neural network produces the inputs \mathbf{z}^3 to our softmax function in the output layer, whereby \mathbf{z}^3 is defined as:

\mathbf{z}^3 = \mathbf{h}^2W^3+\mathbf{b}^3

where:

\mathbf{h}^2 = \begin{bmatrix} h^2_1 & h^2_2 & h^2_3 & h^2_4 \\\end{bmatrix}

W^3 = \begin{bmatrix} w_{1,1}^3 & w_{1,2}^3 & w_{1,3}^3 \\ w_{2,1}^3 & w_{2,2}^3 & w_{2,3}^3 \\ w_{3,1}^3 & w_{3,2}^3 & w_{3,3}^3 \\ w_{4,1}^3 & w_{4,2}^3 & w_{4,3}^3 \\\end{bmatrix}

\mathbf{b}^3 = \begin{bmatrix} b^3_1 & b^3_2 & b^3_3 \\\end{bmatrix}

So we know how to find \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} so now our goal is to find:

  • \frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2}
  • \frac{\partial \mathbf{z}^3}{\partial W^3}
  • \frac{\partial \mathbf{z}^3}{\partial \mathbf{b}^3}

After we find these gradients we can simply multiply them by \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \delta^1 to get the gradients of \mathbf{h}^2,W^3, \mathbf{b}^2 with respect to our loss function H.

For example:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}= \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} = \delta^1\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2}

Finding \frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2}:

We first start with trying to find \frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} , like before we will make the problem easier by first trying to find  \frac{\partial \mathbf{z}^3}{\partial{h}^2_1}.

hidden layer 2 with gradients.png

We can represent \frac{\partial \mathbf{z}^3}{\partial{h}^2_1} as a vector of partial derivatives like so:

\frac{\partial \mathbf{z}^3}{\partial{h}^2_1} = \begin{bmatrix}\frac{\partial z^3_1}{\partial{h}^2_1} \\\frac{\partial z^3_2}{\partial{h}^2_1} \\\frac{\partial z^3_3}{\partial{h}^2_1} \end{bmatrix}

By replacing h^2_1 with any other node in hidden layer 2, we can see that:

\frac{\partial \mathbf{z}^3}{\partial{h}^2_2} = \begin{bmatrix}\frac{\partial z^3_1}{\partial{h}^2_2} \\\frac{\partial z^3_2}{\partial{h}^2_2} \\\frac{\partial z^3_3}{\partial{h}^2_2} \end{bmatrix}

\frac{\partial \mathbf{z}^3}{\partial{h}^2_3} = \begin{bmatrix}\frac{\partial z^3_1}{\partial{h}^2_3} \\\frac{\partial z^3_2}{\partial{h}^2_3} \\\frac{\partial z^3_3}{\partial{h}^2_3} \end{bmatrix}

\frac{\partial \mathbf{z}^3}{\partial{h}^2_4} = \begin{bmatrix}\frac{\partial z^3_1}{\partial{h}^2_4} \\\frac{\partial z^3_2}{\partial{h}^2_4} \\\frac{\partial z^3_3}{\partial{h}^2_4} \end{bmatrix}

By concatentating these vectors we get \frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2}:

\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} =\begin{bmatrix}\frac{\partial \mathbf{z}^3}{\partial{h}^2_1} &\frac{\partial \mathbf{z}^3}{\partial{h}^2_2} &\frac{\partial \mathbf{z}^3}{\partial{h}^2_3} &\frac{\partial \mathbf{z}^3}{\partial{h}^2_4} \end{bmatrix}\ = \begin{bmatrix} \frac{\partial z^3_1}{\partial{h}^2_1} &\frac{\partial z^3_1}{\partial{h}^2_2}&\frac{\partial z^3_1}{\partial{h}^2_3}&\frac{\partial z^3_1}{\partial{h}^2_4}\\ \frac{\partial z^3_2}{\partial{h}^2_1} &\frac{\partial z^3_2}{\partial{h}^2_2} &\frac{\partial z^3_2}{\partial{h}^2_3}&\frac{\partial z^3_2}{\partial{h}^2_4} \\ \frac{\partial z^3_3}{\partial{h}^2_1}&\frac{\partial z^3_3}{\partial{h}^2_2}&\frac{\partial z^3_3}{\partial{h}^2_3}&\frac{\partial z^3_3}{\partial{h}^2_4} \end{bmatrix}

This is the Jacobian between layers \mathbf{z}^3 and \mathbf{h}^2. Again we can see that the columns of the Jacobian represents the edges from all the nodes in layer \mathbf{z}^3 to a node in layer \mathbf{h}^2.

All that is left now is to find the derivatives in the matrix we first start by finding \frac{\partial z^3_1}{\partial{h}^2_1}:

\frac{\partial z^3_1}{\partial{h}^2_1} =\frac{\partial}{\partial{h}^2_3}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1)=w_{1,1}^2

Let’s find a few more to see if we can find a pattern:

\frac{\partial z^3_1}{\partial{h}^2_2} =\frac{\partial}{\partial{h}^2_1}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1)=w_{2,1}^2

\frac{\partial z^3_1}{\partial{h}^2_3} =\frac{\partial}{\partial{h}^2_3}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1)=w_{3,1}^2

\frac{\partial z^3_1}{\partial{h}^2_4} =\frac{\partial}{\partial{h}^2_4}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1)=w_{4,1}^2

\frac{\partial z^3_2}{\partial{h}^2_1} =\frac{\partial}{\partial{h}^2_1}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2)=w_{1,2}^2

Hopefully now, it shouldn’t be hard to convince yourself now that:

\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} = \begin{bmatrix} w_{1,1}^2 & w_{2,1}^2 & w_{3,1}^2 & w_{4,1}^2 \\ w_{1,2}^2 &w_{2,2}^2 &w_{3,2}^2 &w_{4,2}^2 \\ w_{1,3}^2 &w_{2,3}^2 &w_{3,3}^2 &w_{4,3}^2 \end{bmatrix} = (W^3)^T

Again this is a pattern that will start to emerge as we continue on. The Jacobian between layers l and l-1 is simply the transpose of the weight matrix connecting them.

Now finally putting it all together:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}= \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} = \delta^1 (W^3)^T

Note how we are able to express \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2} as a function our of previously calculated gradients \delta^1, there was no need for us to recalculate it from scratch.

Finding \frac{\partial \mathbf{z}^3}{\partial W^3}:

At first this might seem very daunting at the start, but again if we just find \frac{\partial \mathbf{z}^3}{\partial w^3_{1,1}} first and work from there we will see that it isn’t as hard as it looks!

\frac{\partial \mathbf{z}^3}{\partial w^3_{1,1}} = \begin{bmatrix}\frac{\partial z^3_1}{\partial w^3_{1,1}} \\\frac{\partial z^3_2}{\partial w^3_{1,1}} \\\frac{\partial z^3_3}{\partial w^3_{1,1}} \end{bmatrix} = \begin{bmatrix}\frac{\partial }{\partial w^3_{1,1}}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1) \\\frac{\partial }{\partial w^3_{1,1}}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2) \\ \frac{\partial }{\partial w^3_{1,1}}(h^2_1w^3_{1,3}+h^2_2w^3_{2,3}+h^2_3w^3_{3,3}+h^2_4w^3_{4,3}+b^3_3) \end{bmatrix} = \begin{bmatrix}h^2_1 \\0\\0 \end{bmatrix}

Lets take 2 more elements of the the weight matrix and find it’s derivative with respect to \mathbf{z}^3:

\frac{\partial \mathbf{z}^3}{\partial w^3_{3,2}} = \begin{bmatrix}\frac{\partial z^3_1}{\partial w^3_{3,2}} \\\frac{\partial z^3_2}{\partial w^3_{3,2}} \\\frac{\partial z^3_3}{\partial w^3_{3,2}} \end{bmatrix} = \begin{bmatrix}\frac{\partial }{\partial w^3_{3,2}}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1) \\\frac{\partial }{\partial w^3_{3,2}}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2) \\ \frac{\partial }{\partial w^3_{3,2}}(h^2_1w^3_{1,3}+h^2_2w^3_{2,3}+h^2_3w^3_{3,3}+h^2_4w^3_{4,3}+b^3_3) \end{bmatrix} = \begin{bmatrix}0 \\h^2_3\\0 \end{bmatrix}

\frac{\partial \mathbf{z}^3}{\partial w^3_{2,3}} = \begin{bmatrix}\frac{\partial z^3_1}{\partial w^3_{2,3}} \\\frac{\partial z^3_2}{\partial w^3_{2,3}} \\\frac{\partial z^3_3}{\partial w^3_{2,3}} \end{bmatrix} = \begin{bmatrix}\frac{\partial }{\partial w^3_{2,3}}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1) \\\frac{\partial }{\partial w^3_{2,3}}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2) \\ \frac{\partial }{\partial w^3_{2,3}}(h^2_1w^3_{1,3}+h^2_2w^3_{2,3}+h^2_3w^3_{3,3}+h^2_4w^3_{4,3}+b^3_3) \end{bmatrix} = \begin{bmatrix}0 \\0\\h^2_2 \end{bmatrix}

So we can see that for any w^3_{i,j}:

\frac{\partial \mathbf{z}^3}{\partial w^3_{i,j}} = \begin{bmatrix}\frac{\partial z^3_1}{\partial w^3_{i,j}} \\\frac{\partial z^3_2}{\partial w^3_{i,j}} \\\frac{\partial z^3_3}{\partial w^3_{i,j}} \end{bmatrix} = \begin{bmatrix}\frac{\partial }{\partial w^3_{i,j}}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1) \\\frac{\partial }{\partial w^3_{i,j}}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2) \\ \frac{\partial }{\partial w^3_{i,j}}(h^2_1w^3_{1,3}+h^2_2w^3_{2,3}+h^2_3w^3_{3,3}+h^2_4w^3_{4,3}+b^3_3) \end{bmatrix} = \begin{bmatrix}\mathbf{1}_{(j=1)}h ^2_i\\\mathbf{1}_{(j=2)}h^2_i\\\mathbf{1}_{(j=3)}h^2_i\end{bmatrix}

where: \mathbf{1}_{(\cdot)} is the indicator function, it simply says that if the condition within the brackets are true then it equals 1, otherwise it equals 0.

Now it might still seem a bit complicated due to the indicator functions, but regardless let’s push on and later we will see that this simplifies to something very elegant! Let us now try to find \frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{i,j}}!

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{i,j}} =\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial w^3_{i,j}}=\delta^1 \frac{\partial \mathbf{z}^3}{\partial w^3_{i,j}} =\begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}\begin{bmatrix}\mathbf{1}_{(j=1)}h ^2_i\\\mathbf{1}_{(j=2)}h^2_i\\\mathbf{1}_{(j=3)}h^2_i\end{bmatrix}

Now lets try to find \frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{1,1}} and see if we can see any patterns:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{1,1}} = \begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}\begin{bmatrix} h^2_1 \\ 0 \\ 0 \end{bmatrix} = \delta^1_{1}h^2_1

Lets try a few more elements of W^3:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{3,2}} = \begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}\begin{bmatrix} 0 \\ h^2_3 \\ 0 \end{bmatrix} = \delta^1_{2}h^2_3

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{2,3}} = \begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}\begin{bmatrix} 0  \\ 0 \\h^2_2 \end{bmatrix} = \delta^1_{3}h^2_2

Now if we stare at these examples a bit we can see that:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{i,j}} = \delta^1_{j}h^2_i

Now we can write out \frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3}:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3} = \begin{bmatrix}\delta^1_{1}h^2_1 & \delta^1_{2}h^2_1 & \delta^1_{3}h^2_1 \\ \delta^1_{1}h^2_2 & \delta^1_{2}h^2_2 & \delta^1_{3}h^2_2 \\ \delta^1_{1}h^2_3 & \delta^1_{2}h^2_3 & \delta^1_{3}h^2_3\\ \delta^1_{1}h^2_4 & \delta^1_{2}h^2_4 & \delta^1_{3}h^2_4\end{bmatrix}

Now again if we stare at it for long enough we can see that this is simply (well maybe not that simple):

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3} = \begin{bmatrix}\delta^1_{1}h^2_1 & \delta^1_{2}h^2_1 & \delta^1_{3}h^2_1 \\ \delta^1_{1}h^2_2 & \delta^1_{2}h^2_2 & \delta^1_{3}h^2_2 \\ \delta^1_{1}h^2_3 & \delta^1_{2}h^2_3 & \delta^1_{3}h^2_3\\ \delta^1_{1}h^2_4 & \delta^1_{2}h^2_4 & \delta^1_{3}h^2_4\end{bmatrix} = \begin{bmatrix}h^2_1\\h^2_2 \\h^2_3\\ h^2_4\end{bmatrix}\begin{bmatrix}\delta^1_{1}&\delta^1_{2}& \delta^1_{3}\end{bmatrix}=(\mathbf{h}^2)^T \delta^1

Finding \frac{\partial \mathbf{z}^3}{\partial \mathbf{b}^3}:

Again, since we are finding the change of a vector with respect to another vector we will have a Jacobian, although this one is substantially easier to calculate! As always we start by trying to find \frac{ \partial  \mathbf{z}^3}{ \partial b^3_{1}} then extend it from there:

\frac{\partial \mathbf{z}^3}{\partial b^3_{1}} = \begin{bmatrix}\frac{\partial z^3_1}{\partial b^3_{1}} \\\frac{\partial z^3_2}{\partial b^3_{1}} \\\frac{\partial z^3_3}{\partial b^3_{1}} \end{bmatrix} = \begin{bmatrix}\frac{\partial }{\partial b^3_{1}}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1) \\\frac{\partial }{\partial b^3_{1}}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2) \\ \frac{\partial }{\partial b^3_{1}}(h^2_1w^3_{1,3}+h^2_2w^3_{2,3}+h^2_3w^3_{3,3}+h^2_4w^3_{4,3}+b^3_3) \end{bmatrix} = \begin{bmatrix}1 \\0\\0 \end{bmatrix}

It shouldn’t be too hard to convince yourself that:

\frac{\partial \mathbf{z}^3}{\partial b^3_{2}} =  \begin{bmatrix} 0 \\1\\0 \end{bmatrix}

and:

\frac{\partial \mathbf{z}^3}{\partial b^3_{3}} =  \begin{bmatrix} 0 \\0 \\1 \end{bmatrix}

Then we concatenate the vectors together to form \frac{\partial \mathbf{z}^3}{\partial \mathbf{b}^3}:

\frac{\partial \mathbf{z}^3}{\partial \mathbf{b}^3} = \mathbf{I}

Therefore:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^3}= \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{b}^3} = \delta^1 \mathbf{I}  = \delta^1

Let reiterate what we have derived:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \mathbf{\hat{y}} - \mathbf{y} =\delta^1 = \begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}= \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} = \delta^1 (W^3)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3} =(\mathbf{h}^2)^T \delta^1

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^3}=\delta^1

Hidden layer 1

The first hidden layer produces the inputs \mathbf{h}^2 to the second hidden layer where \mathbf{h}^2 is defined as:

\mathbf{h}^2 = \begin{bmatrix} \sigma(z^2_1) & \sigma(z^2_2)& \sigma(z^2_3)& \sigma(z^2_4) \end{bmatrix}

where: \sigma(\cdot) = \frac{1}{1+exp(-(\cdot))} and \sigma'(\cdot) = \sigma(\cdot)(1-\sigma(\cdot))

The first step is to find \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2} and express it using previously calculated values like we did for \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}. To do this we need to first find \frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2} after we have found it we can simply do:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2}= \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2}\frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2}= \delta^1 (W^3)^T\frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2}

to find \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2}.

To find \frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2} we will use the same trick we used before and find \frac{\partial \mathbf{h}^2}{\partial z^2_1} first and work from there:

\frac{\partial \mathbf{h}^2}{\partial z^2_1}= \begin {bmatrix} \frac{\partial}{\partial z^2_1}\sigma(z^2_1) \\ \frac{\partial}{\partial z^2_2}\sigma(z^2_1) \\ \frac{\partial}{\partial z^2_3}\sigma(z^2_1) \\ \frac{\partial}{\partial z^2_4}\sigma(z^2_1) \end{bmatrix} = \begin {bmatrix} \sigma(z^2_1) (1 - \sigma(z^2_1)) \\ 0 \\0 \\0 \end{bmatrix}

If we did the same thing for z^2_2, z^2_3, z^2_4 and concatenated their vectors we would get:

\frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2} = \begin {bmatrix} \sigma(z^2_1)(1 - \sigma(z^2_2)) & 0 & 0 & 0  \\ 0 &  \sigma(z^2_2)(1 - \sigma(z^2_2)) & 0 & 0\\0 & 0 & \sigma(z^2_3)(1 - \sigma(z^2_3)) & 0 \\0 & 0 & 0 & \sigma(z^2_4)(1 - \sigma(z^2_4)) \end{bmatrix}

Then:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2}=  \delta^1 (W^3)^T\frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2} = \delta^1 (W^3)^T \circ \sigma'(\mathbf{z}^2) = \delta^2

where:

\sigma'(\mathbf{z}^2) = \begin{bmatrix} \sigma(z^2_1)(1 - \sigma(z^2_1)) & \sigma(z^2_2)(1 - \sigma(z^2_2))& \sigma(z^2_3)(1 - \sigma(z^2_3))& \sigma(z^2_4)(1 - \sigma(z^2_4)) \end{bmatrix}

Note here that \circ is called the Hadamard product, which is an element-wise multiplication of two vectors/matrices with the same dimension.

From here on out it’s all rinse and repeat, we just follow the same steps that we went through above! First lets define some variables!

\mathbf{z}^2 = \mathbf{h}^1W^2+\mathbf{b}^2

where:

\mathbf{h}^1 = \begin{bmatrix} h^1_1 & h^1_2 \end{bmatrix}

W^2 = \begin{bmatrix} w_{1,1}^2 & w_{1,2}^2 & w_{1, 3}^2 & w_{1, 4}^2 \\ w_{2,1}^2 & w_{2,2}^2 & w_{2,3}^2 & w_{2,4}^2 \end{bmatrix}

\mathbf{b}^2 = \begin{bmatrix} b^2_1 & b^2_2 & b^2_3 & b^2_4 \\\end{bmatrix}

So we know how to find \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2} so now our goal is to find:

  • \frac{\partial \mathbf{z}^2}{\partial \mathbf{h}^1}
  • \frac{\partial \mathbf{z}^2}{\partial W^2}
  • \frac{\partial \mathbf{z}^2}{\partial \mathbf{b}^2}

If we go through the same steps as we did above then we would get:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^1} = \delta^2 (W^2)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^2} =(\mathbf{h}^1)^T \delta^2

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^2}=\delta^2

I will leave this for you to verify for yourself (as working through it yourself will provide you with much more value than me working through the example for you, though if you get stuck I’m more than happy to help!).

Input layer

The input layer produces the inputs \mathbf{h}^1 to the first hidden layer where \mathbf{h}^1 is defined as:

\mathbf{h}^1 = \begin{bmatrix} \sigma(z^1_1) & \sigma(z^1_2) \end{bmatrix}

Like last time we have to find \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^1} and express it using previously calculated values.

Then you should end up with something like this:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^1}=\delta^2 (W^2)^T \circ \sigma'(\mathbf{z}^1) = \delta^3

Where:

\sigma'(\mathbf{z}^1) = \begin{bmatrix} \sigma(z^1_1)(1 - \sigma(z^1_1)) & \sigma(z^1_2)(1 - \sigma(z^1_2)) \end{bmatrix}

You are so close to finally finishing one iteration of the backpropagation algorithm! For this last part we will again need to define some variables!

Let:

\mathbf{z}^1 = \mathbf{x}W^1+\mathbf{b}^1

Where:

\mathbf{x} = \begin{bmatrix} x_1 & x_2 \end{bmatrix}

W^1 = \begin{bmatrix} w_{1,1}^1 & w_{1,2}^1 & w_{1,3}^1 \\ w_{2,1}^1 & w_{2,2} & w_{2,3}^1 \end{bmatrix}

\mathbf{b}^1 = \begin{bmatrix} b^1_1 & b^1_2 & b^1_3 \\\end{bmatrix}

As we already know \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^1} our goal is to find:

  • \frac{\partial \mathbf{z}^1}{\partial \mathbf{x}}
  • \frac{\partial \mathbf{z}^1}{\partial W^1}
  • \frac{\partial \mathbf{z}^1}{\partial \mathbf{b}^1}

This is what you should be getting:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{x}} = \delta^3 (W^1)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^1} =(\mathbf{x})^T \delta^3

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^1}=\delta^3

Summary of gradients

Output layer:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \mathbf{\hat{y}} - \mathbf{y} =\delta^1

Hidden layer 2:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}=\delta^1 (W^3)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3} =(\mathbf{h}^2)^T \delta^1

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^3}=\delta^1

Hidden layer 1:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2}=\delta^1 (W^3)^T \circ \sigma'(\mathbf{z}^2) = \delta^2

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^1} = \delta^2 (W^2)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^2} =(\mathbf{h}^1)^T \delta^2

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^2}=\delta^2

Input layer:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^1}=\delta^2 (W^2)^T \circ \sigma'(\mathbf{z}^1) = \delta^3

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{x}} = \delta^3 (W^1)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^1} =(\mathbf{x})^T \delta^3

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^1}=\delta^3

There are a few interesting observations that can be made, assuming that we have a neural network with L layers where layer L is the output layer and layer 1 is the input layer \mathbf{x} (so to clarify \mathbf{input}^1 = \mathbf{x} and \mathbf{input}^2 = \mathbf{h}^1 and so on) then for all layers l:

  1. \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{input}^l}=\delta^{L-l} (W^{l})^T
  2. \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial W^l}= (\mathbf{input}^l)^T \delta^{L-l}
  3. \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^l}= \delta^{L-l}
  4. The abstraction step is always made for the gradient of the cost function with respect to the output of a layer. For example \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^1} = \delta^3 note how \mathbf{z}^1 is the output of the input layer. This is because the gradient of the cost function with respect to the output of the layer is used in the expression of the gradients of the cost function with respect to the weight, biases and inputs of the layer.
Extension to using more than 1 sample in the backward pass

Extending the backpropagation algorithm to take more than one sample is relatively straightforward, the beauty of using matrix notation is that we don’t really have to change anything! As an example let’s run the backward pass using 3 samples instead of 1 on the output layer and hidden layer 2.

For \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \begin{bmatrix} \mathbf{\hat{y}}_1 - \mathbf{y}_1 \\  \mathbf{\hat{y}}_2 - \mathbf{y}_2  \\\mathbf{\hat{y}}_3 - \mathbf{y}_3 \end{bmatrix}= \begin{bmatrix} \delta^1_{1,1} & \delta^1_{1,2} & \delta^1_{1,3} \\ \delta^1_{2,1} &  \delta^1_{2,2} &  \delta^1_{2,3} \\ \delta^1_{3,1} & \delta^1_{3,2} & \delta^1_{3,3} \end{bmatrix} = \delta^1

For \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}=\delta^1 (W^3)^T  =\begin{bmatrix} \delta^1_{1,1} & \delta^1_{1,2} & \delta^1_{1,3} \\ \delta^1_{2,1} &  \delta^1_{2,2} &  \delta^1_{2,3} \\ \delta^1_{3,1} & \delta^1_{3,2} & \delta^1_{3,3} \end{bmatrix}\begin{bmatrix} w_{1,1}^3 & w_{2,1}^3 & w_{3,1}^3 & w_{4, 1}^3 \\ w_{1,2}^3 & w_{2,2}^3 & w_{3,2}^3 & w_{4,2} \\ w_{1,3}^3 & w_{2,3}^3 & w_{3,3}^3 & w_{4,3}\end{bmatrix}

For \frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3}:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3} =(\mathbf{h}^2)^T \delta^1 =\begin{bmatrix} h^2_{1,1} & h^2_{2, 1} & h^2_{3, 1} \\h^2_{1,1} & h^2_{2, 1} & h^2_{3, 1} \\h^2_{1,1} & h^2_{2, 1} & h^2_{3, 1} \\h^2_{1,1} & h^2_{2, 1} & h^2_{3, 1} \\ \end{bmatrix}\begin{bmatrix} \delta^1_{1,1} & \delta^1_{1,2} & \delta^1_{1,3} \\ \delta^1_{2,1} &  \delta^1_{2,2} &  \delta^1_{2,3} \\ \delta^1_{3,1} & \delta^1_{3,2} & \delta^1_{3,3} \end{bmatrix}

Here we see that:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{1,1}} = h^2_{1, 1} \delta^1_{1, 1} + h^2_{2, 1} \delta^1_{2, 1} + h^2_{3, 1} \delta^1_{3, 1}

If we think about it it makes sense, each term in the summation is how much the cost function would change with respect to w^3_{1, 1} for each sample. For example,  h^2_{1, 1} \delta^1_{1, 1} is how much the cost function would change with respect to w^3_{1, 1} for sample 1. So since we have 3 samples, we need to sum the change of the cost function with respect to w^3_{1, 1} for all of the samples to find the total change in the cost function.

For \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^3}:

This is the only term that will change when we are extending from 1 sample to more than one samples. Before I show you what the answer is lets first take a look at \delta^1:

\delta^1=\begin{bmatrix} \delta^1_{1,1} & \delta^1_{1,2} & \delta^1_{1,3} \\ \delta^1_{2,1} &  \delta^1_{2,2} &  \delta^1_{2,3} \\ \delta^1_{3,1} & \delta^1_{3,2} & \delta^1_{3,3} \end{bmatrix}

Now, remember from before that \delta^1_{1, 1} is how much the cost function changes with respect to b^3_1 for sample 1; \delta^1_{2, 1} is how much the cost function changes with respect to b^3_1 for sample 2 and so on. Since the bias is shared across all samples to find \frac{\partial \mathbf{H(y, \hat{y}})}{\partial b^3_{1}} we simply sum up the first column of \delta^3, this is summing up how much the cost function with respect to b^3_1 for each sample.

Therefore we end up with:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^3} = \begin{bmatrix} \sum_{i=1}^{3}\delta^1_{i, 1} & \sum_{i=1}^{3}\delta^1_{i, 2} & \sum_{i=1}^{3}\delta^1_{i, 3} \end{bmatrix}

From here it should be relatively simple to repeat what I have done to hidden layer 1 and the input layer.

Conclusion

This post turned out to be a lot longer than I expected but I hope that you have learned how backpropagation works and hopefully you are able to apply it to a neural network of any arbitrary depth! This is my first blog post so I’m sure that there are things that I have not explained clearly so please don’t hesitate to tell me so that I can become a better writer 🙂

 

What is a neural network?

Introduction

Before we start to talk about what a neural network is or how it works, let’s first understand how the brain works. The reason being, the architecture of a neural network is based on how the brain works. I personally feel it is a lot easier to understand an abstract concept, when we are able to compare it with something physical! The brain works because of the neurons that it contains which work together and communicate with each other. These neurons are what enable us to perform the seemingly simply, but complex, tasks we execute everyday.

What is a neuron?

Screen Shot 2016-12-26 at 11.02.43 pm.png
Figure 1: Diagram of a single neuron

The dendrites, or “tentacles,” of a neuron pick up information in the form of an electrical nerve impulse and conduct it toward the soma, or cell body. The nerve impulse travels through the soma and is then conducted down a threadlike fiber called the axon. At the end of the axon if the activation energy is reached, the impulse jumps over a gap called a synapse to reach the dendrites of the next cell, otherwise it fails to make the jump and the electrical impulse isn’t transmitted.

Although the diagram above only shows one neuron, an average person has 100 billion neurons. Some neurons in the brain have up to 200,000 dendrites and axon terminals. Imagine listening to 200,000 people while at the same time talking to another 200,000 people, wouldn’t that be tiring!

What is a neural network?

A neural network is simply a number of neurons (see figure 2) that are connected together. These neurons are connected in such a way that they form layers, so the outputs from a layer of neurons is passed on as inputs to the neurons in the next layer.

Structure of a neural network

FullSizeRender-2.jpg
Figure 2: Two-layer neural network

Notation:

  • x_1, x_2, x_3 are the inputs to the neural network
  • b^l_i is read as the bias (or fixed value) that node i in layer l has
  • w^l_{i,j} is the weight of the edge connecting node i in layer $l – 1$ to node j in layer l
  • l^l_i is the linear node i in layer l which aggregates the information from the inputs to the layer
  • s^l_i is the output node i for layer l, or equivalently, the input node i for layer l + 1.  it may seem a bit weird to use s for the output node, but there is a reason for this and it is related to the sigmoid function.

Don’t worry too much about the notation, I am introducing it here so that you will be comfortable with it when we start to represent the neural network as matrices.

Structure of a single neuron (in a neural network)

FullSizeRender-4.jpg
Figure 3: A single neuron

Figure 3 shows a green color-coded neuron. A neuron is made up of two parts, an input layer which takes the information from the incoming neurons, and a hidden that consists of two hidden nodes. The two layers are connected by edges, whereby in the case above the neuron is connected by the green edges.

The first hidden node is called a linear node (l^1_1),  which aggregates the input. The output of the linear node is then passed into the second node called the activation node (s^1_1) which applies a function to the output of the linear node. The blue and orange edges are the connections to the outbound neurons of this neuron, each color signifying a different neuron.

Layers in a neural network

FullSizeRender-6.jpg
Figure 4: The first layer of the neural network

 

Input layer:

The input layer (which is a set of input nodes) is where the green and pink neurons in the first layer receives its input data. The input layer has 4 nodes, the first three nodes x_1,x_2,x_3 are the inputs to the neurons in the layer. The last node, b^1 can be thought of as individual base value that the each neuron in the layer has.

In the example above, the first layer was chosen for simplicity sake and as a result the input layer also happens to be the input to the neural network. Although this doesn’t have to be the case, consider the outputs of the neuron in the first layer, s^1_1, s^1_2, which are inputs to the blue and orange neurons in the second layer.

We can represent inputs to a layer as:

  • \mathbf{x}  = \begin{bmatrix} x_{1} & x_2 & x_3 \end{bmatrix}

With the bias for the each of the nodes in the linear layer as:

  • b^1 = \begin{bmatrix} b^1_{1} & b^1_{2} \end{bmatrix}

Where b^1_1 is the base value that l^1_1 has.

Hidden layer: 

The hidden layer is where all the magic happens, this is what makes a neural network so fundamentally different from other machine learning techniques (I will elaborate about why this is in a future post).

There are two parts that make up a hidden layer (which is just a group a hidden nodes), the first part is the linear layer which aggregates the values of the inputs to the hidden layer. The second is an activation layer which decides whether or not information should be transmitted to the next neuron.

Part 1: The linear layer

The linear layer is made up of a group of linear nodes, each node is a  linear combination (for example z = ax +by + c is a linear combination) of the inputs, which is common to all the neurons in the layer, and an individual set of weights that each neuron in the layer possesses. In figure 4, we see that the inputs

Since we have two neurons in each of the two hidden layers (green and pink for the first layer, and blue and orange for the second layer) we will have two linear combinations in each of the two layers.

The linear layer for hidden layer 1 is:

  • l^1_1 = x_1w^1_{1,1} + x_2w^1_{2,1} + x_3w^1_{3,1} + b^1_1
  • l^1_2 =x_1w^1_{1,1} + x_2w^1_{2,1} + x_3w^1_{3,1} + b^1_2

The linear layer for hidden layer 2 is:

  • l^2_1 = s^1_1w^2_{1,1} + s^1_2w^2_{2,1} + b^2_1
  • l^2_2 = s^1_1w^2_{1,2} +s^1_1w^2_{2,2} + b^2_2

These can be represented in matrix notation like so:

  • l^1 = \begin{bmatrix} l^1_1 & l^1_2 \end{bmatrix}
  • l^2 = \begin{bmatrix} l^2_1 & l^2_2 \end{bmatrix}

Using the brain analogy, the dendrites in the neuron can be seen as the input layer, and the electrical impulse travelling over the cell body can be seen as inputs being multiplied by the individual set of weights corresponding to each neuron.

Part 2: The activation layer

The second part is an activation layer. In this layer we apply some function to the values that is output from the linear layer. The function usually squashes the input, and it is called the activation layer because each node in the layer loosely mimics the axon terminals and the electrical impulses jumping over the gap to the next neuron.

Each node in the activation layer can be thought of as an axon terminal. The only difference is the electrical impulse travelling through the axon terminal will be a binary outcome, which can be thought of as a step function (0 for failing to make the jump, and 1 for making the jump).

Screen Shot 2016-12-30 at 2.07.35 am.png
Figure 5: Unit step function

 

The activation function that is used the neural network, the sigmoid function, can take all the values between 0 (the electrical impulse doesn’t make it to the next neuron) and 1 (the electrical impulse makes it to the next neuron). The main reason for the approximation for the step function is because when the neural network is trained, it uses a technique called gradient descent to find the optimal bias and weights; and this technique only works on continuous function.

The sigmoid function:

  • \sigma(z) = \frac{1}{1+e^{-z}}
Screen Shot 2016-12-27 at 1.39.20 am.png
Figure 6: Plot of sigmoid function

The activation layer for hidden layer 1 is:

  • s^1_1 = \sigma(l^1_1)
  • s^1_2 = \sigma(l^1_2)

The activation layer for hidden layer 2 is:

  • s^2_1 = \sigma(l^2_1)
  • s^2_2 = \sigma(l^2_2)

These can be represented in matrix notation like so:

  • s^1 = \begin{bmatrix}\sigma(l^1_1) &\sigma(l^1_2) \end{bmatrix}
  • s^2 = \begin{bmatrix}\sigma(l^2_1) &\sigma(l^2_2) \end{bmatrix}

Output layer: 

FullSizeRender-7.jpg
Figure 7: Output layer of the neural network

The output layer (although it really is just a single node in our case) for the neural network above is the value that is predicted by the neural network.

In the neural network the output layer is defined as:

  • l^3_1 = s^2_1 w^3_{1} + s2_2 w^3_{2} + b^3_1

Summary

To reinforce the concept of how information flows through the neural network, I will expand the output term and connect it with the color-coded neural network.

FullSizeRender-8.jpg
Figure 8: Expansion of the output term

When information flows into the neural network the two neurons (green and pink) in the first layer receives it and processes it. The outputs of the green and pink neurons are then used as the inputs to the two neurons (blue and orange) in the second layer. Finally, the outputs of the blue and orange neurons are used as an input for the output neuron.

The important concept to understand here is that the input to some layer l is a function of outputs of the neurons in the previous l - 1 layers. For example, the inputs to l^3_1 (the output layer) is a function of s^2_1 and s^2_2, which are the outputs of the blue and orange neurons. Although, s^2_1 and s^2_2 are both functions of s^1_1 and s^1_2, which are the outputs of the green and pink neurons.

Another thing to be aware of is that we are not limited to neural networks of just two layers. To make the neural network three layers we could use s^2_1 and s^2_2 as inputs to the 3rd hidden layer! Moreover, you should be able to convince yourself that we could theoretically have as many layers as we want by using the outputs of the previous layer as inputs to the next layer.

Hopefully, you have a better understanding how information flows from the start of the network to the end of the network. Most importantly though, I hope that you are able to intuitively understand how a neural network works by thinking about it in terms of how the brain works!