Deriving backpropagation using diagrams

Introduction

After writing my first blog post on deriving batch backpropagation, it always bothered me that the post seemed to have too little diagrams. So, in this post, my goal is to derive batch backpropagation by using as many diagrams as possible.

First, let me introduce you to a diagram that we will be using for the remainder of the post.

Forward pass.png
Figure 1: One layer feed-forward neural network with squared loss, and a batch size of 2

It might not seem obvious at first, but the diagram below is a one layer feed-forward neural network, with a batch size of two (denoted by the light and dark red colors) that outputs a single value which is scored by the squared error loss. Though, pictures probably explain it better than I do. Below is a diagram of the first row of the data matrix X multiplied by the first column of the weight matrix W^{1}

Forward pass only 1.png
Figure 2: First row of data matrix, multiplied by the first column of the first layer of weights

I’m going to assume that you already have a good idea of what happens in the forward pass so the diagram above should (hopefully) make sense (though if not please comment down below and I will explain further!) Now before we get into deriving backpropagation lets first define some notation

Notation

Input variables:

X = \begin{bmatrix} x_{1, 1} & x_{1, 2} \\ x_{2, 1} & x_{2, 2} \end{bmatrix}

First layer weights:

W^{1} = \begin{bmatrix} w^{1}_{1, 1} & w^{1}_{1, 2} &w^{1}_{1, 3} \\ w^{1}_{2, 1} & w^{1}_{2, 2} & w^{1}_{2, 3} \end{bmatrix}

First layer bias:

B^{1} = \begin{bmatrix} b^{1}_1 & b^{1}_2 & b^{1}_3 \\b^{1}_1 & b^{1}_2 & b^{1}_3 \end{bmatrix}

First layer linear combination:

L^{1} = XW^{1} + B^{1}= \begin{bmatrix} l^{1}_{1,1} & l^{1}_{1, 2} & l^{1}_{1, 3} \\ l^{1}_{2, 1} & l^{1}_{2, 2} & l^{1}_{2, 3} \end{bmatrix} = \begin{bmatrix} x_{1, 1} w^{1}_{1, 1} + x_{1, 2} w^{1}_{2, 1}+b^{1}_1 & x_{1, 1} w^{1}_{1, 2} + x_{1, 2} w^{1}_{2, 2}+b^{1}_2 & x_{1, 1} w^{1}_{1, 3} + x_{1, 2} w^{1}_{2, 3}+b^{1}_3 \\ x_{2, 1} w^{1}_{1, 1} + x_{2, 2} w^{1}_{2, 1}+b^{1}_1 & x_{2, 1} w^{1}_{1, 2} + x_{2, 2} w^{1}_{2, 2}+b^{1}_2 & x_{2, 1} w^{1}_{1, 3} + x_{2, 2} w^{1}_{2, 3}+b^{1}_3 \end{bmatrix}

First hidden layer:

H^{1} =\begin{bmatrix} \sigma(l^{1}_{1,1}) & \sigma(l^{1}_{1, 2}) &\sigma(l^{1}_{1, 3}) \\ \sigma(l^{1}_{2, 1}) & \sigma(l^{1}_{2, 2}) & \sigma(l^{1}_{2, 3}) \end{bmatrix}

Output layer weights:

W^{2} = \begin{bmatrix} w^{2}_{1, 1} \\ w^{2}_{2,1} \\ w^{2}_{3,1} \end{bmatrix}

Output layer bias:

B^{2} = \begin{bmatrix} b^{2}_1 \\ b^{2}_1\end{bmatrix}

Output layer:

\hat{Y} = H^{1}W^{2} + B^{2} = \begin{bmatrix} \hat{y}_1 \\ \hat{y}_2  \end{bmatrix}

Loss:

L(\hat{y}_{1,1}, y_{1,1}, \hat{y}_{2,1}, y_{2,1}) = \frac{1}{2}\sum_{i=1}^{2}(\hat{y}_{i,1} - y_{i,1})^{2}

Deriving backpropagation by following lines

Before we start getting into the details, we should first understand how backpropagation works from a high level. The goal of backpropagation is to calculate how much the final error value changes given a change some node (be it a weight node or a hidden variable node).

This is done through the accumulation of gradients starting some root node, our loss node, to the leaf nodes (and any nodes along the way) that we are interested in, usually our weight and hidden variable nodes. These gradients are represented as arrows in the diagrams that are shown below.

In fact, all that you need to be able to do to understand backpropagation is to follow the arrows and multiplying their values along the way until you reach the end. This is the approach that we will be taking, we will slowly build up arrows that lead from the loss node to our input variable nodes.

During backpropagation there are 5 steps that you will need to follow, assuming that we have a neural network with L layers:

  1. Assuming we are on layer l in the backward pass, using the previously accumulated gradients from previous layers, which we will denote using the matrix \mathbf{\delta^{L - l}}, calculate the accumulated gradients for the weights of the current layer
  2. Using \mathbf{\delta^{l}} calculate the accumulated gradients for the biases of the current layer
  3. Using \mathbf{\delta^{l}} calculate the accumulated gradients of the inputs of the current layer
  4. Using the result from the previous step, calculate the accumulated gradients of the inputs of the current layer before the non-linearity, and then abstract these values into a matrix \mathbf{\delta^{L - ( l+ 1)}}
  5. Start go back to step 1, except now we are on layer l - 1

The loss layer: setting up \mathbf{\delta^{1}}

As mentioned above, each of the arrows represents the change in the root node with respect to the leaf node(s). For example, in figure 3, the root node is our loss function, and the leaf nodes are our predicted values.

Loss layer.png
Figure 3: The Loss layer

As we are using the squared error loss, the derivative of our loss function with respect to our predicted variables \hat{Y} is relatively simple. So by doing some simple math, we should be able to calculate :

\frac{\partial{L(\hat{y}_{1,1}, y_{1,1}, \hat{y}_{2,1}, y_{2,1})}}{\partial{\hat{Y}}} = \begin{bmatrix} \frac{\partial{L(\hat{y}_{1,1}, y_{1,1}, \hat{y}_{2,1}, y_{2,1})}}{\partial{\hat{y_{1,1}}}} \\ \frac{\partial{L(\hat{y}_{1,1}, y_{1,1}, \hat{y}_{2,1}, y_{2,1})}}{\partial{\hat{y_{2,1}}}} \end{bmatrix} = \begin{bmatrix} \hat{y_{1,1}} - y_{1,1} \\ \hat{y_{2,1}} - y_{2,1} \end{bmatrix}

Now we make the abstraction, I will denote:

\mathbf{\delta^1} = \begin{bmatrix}\delta^{1}_{1,1} \\ \delta^{1}_{2, 1}\end{bmatrix} = \begin{bmatrix} \hat{y_{1,1}} - y_{1,1} \\ \hat{y_{2,1}} - y_{2,1} \end{bmatrix}

So let’s recap, what \mathbf{\delta^1} means. Each row of \mathbf{\delta^1} represents the derivative of the loss function with respect to our predicted values. Another way to think about it is the accumulated gradients of the loss function up to the nodes; in our case \hat{y_{1,1}} and \hat{y_{2,1}}. In fact, the latter interpretation is fundamental to deriving the backpropagation algorithm as the \mathbf{\delta^{L-l}} matrix holds all of the accumulated gradients from previous layers and allows us to abstract away those previous calculation! If you don’t get what I mean, keep reading and hopefully it will become more clear.

Alright sweet, now let us move onwards! The next step will be to calculate the gradients for the weights.

Step 1: Calculating the accumulated gradients for W^{2}

To start, we won’t be calculating the accumulated gradients for all the output weights at once, but rather we will start by only considering w^{2}_{1,1}. As once we see one example, calculating the accumulated gradients for w^{2}_{2,1} and w^{2}_{3,1} is very similar.

Layer 2 weight only 1.png
Figure 4: Accumulated gradients for w^2_{1, 1}

In figure 4, we see that for w^{2}_{1,1}, there are two arrows leading into it, one from \hat{y_{1,1}} and one from \hat{y_{2,1}}. If you haven’t already, you should try to understand why this is happening by yourself before reading the explanation!

Now for the explanation… Since w^{2}_{1,1}, is used to calculate both \hat{y_{1,1}} and \hat{y_{2,1}} a change w^{2}_{1,1} would change both\hat{y_{1,1}} and \hat{y_{2,1}}! Therefore, when w^{2}_{1,1} changes it is able to change the final loss function in two ways through\hat{y_{1,1}} and \hat{y_{2,1}}, and as a result we have two arrows (representing gradients) heading into w^{2}_{1,1}.

Now looking at the diagram we can follow the arrows from the loss node that lead to the weight node. We can see that the accumulated gradients for w^{2}_{1,1} is simply:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{2}_{1,1}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \frac{\partial{\hat{y}_{1, 1}}}{\partial{w^{2}_{1,1}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \frac{\partial{\hat{y}_{2, 1}}}{\partial{w^{2}_{1,1}}}

= (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,1})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,1})

Now if we follow the same steps, and follow the arrows that lead from the loss node to the w^{2}_{2,1} and w^{2}_{3,1}

We can see that:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{2}_{i,1}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \frac{\partial{\hat{y}_{1, 1}}}{\partial{w^{2}_{i,1}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \frac{\partial{\hat{y}_{2, 1}}}{\partial{w^{2}_{i,1}}}

= (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,i})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,i})

Now if we do this for all the weights! We end up with this diagram below!

Layer 2 weights.png
Figure 5: Accumulated gradients for all the output weights

 

Step 2: Calculating the accumulated gradients for B^{2}

To calculate the accumulated gradients for our bias we simply follow the same thing that we did for the weights!

Screen Shot 2017-10-28 at 1.50.39 AM.png
Figure 6: Accumulated gradients for b^{2}_{1,1}

Again, there are two arrows pointing into b^{2}_{1,1}, as changing  b^{2}_{1,1} would change both \hat{y_{1, 1}} and \hat{y_{2, 1}}. Now to calculate the accumulated gradients we simply add up the gradients (like before!).

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{b^{2}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \frac{\partial{\hat{y}_{1, 1}}}{\partial{b^{2}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \frac{\partial{\hat{y}_{2, 1}}}{\partial{b^{2}}}

= (\hat{y}_{1, 1} - y_{1, 1})(1)+(\hat{y}_{2, 1} - y_{2, 1})(1)

= (\hat{y}_{1, 1} - y_{1, 1}) +(\hat{y}_{2, 1} - y_{2, 1})

Now lets move on to calculating the accumulated gradients with respect to the first hidden layer!

Step 3: Calculating the accumulated gradients for H^{1}

To calculate the accumulated gradients for the hidden layer, we again just follow the lines!

Layer 2 hidden.png
Figure 7: Accumulated gradients for H^{2}

Looking at figure 7, we can see that:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{i,j}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{i,1}}} \frac{\partial{\hat{y}_{i,1}}}{\partial{h^{1}_{i,j}}}=(\hat{y_{i,1}}-y_{i,1})(w^{2}_{j,1})

Summary of the accumulated gradients for the output layer

So we have successfully calculated the accumulated gradients for the output layer! Here is how our diagram looks!

Screen Shot 2017-10-28 at 2.06.38 AM
Figure 8: The loss and output layer

Now, lets take a look at the equations that calculate the accumulated gradients from the errors we derived from the previous sections:

The weights:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{2}_{i,1}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \frac{\partial{\hat{y}_{1, 1}}}{\partial{w^{2}_{i,1}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \frac{\partial{\hat{y}_{2, 1}}}{\partial{w^{2}_{i,1}}}

= (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,i})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,i})

The bias:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{b^{2}_1}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \frac{\partial{\hat{y}_{1, 1}}}{\partial{b^{2}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \frac{\partial{\hat{y}_{2, 1}}}{\partial{b^{2}}}

= (\hat{y}_{1, 1} - y_{1, 1}) +(\hat{y}_{2, 1} - y_{2, 1})

The hidden variables:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{i,j}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{i,1}}} \frac{\partial{\hat{y}_{i,1}}}{\partial{h^{1}_{i,j}}}

=(\hat{y_{i,1}}-y_{i,1})(w^{2}_{j,1})

Now, we can see that \hat{y}_{1, 1} - y_{1, 1} and \hat{y}_{2, 1} - y_{2, 1} appear in all the equations. So we know that our equations will involve:

\mathbf{\delta^1} = \begin{bmatrix}\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{1, 1}}} \\\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{2, 1}}} \end{bmatrix} = \begin{bmatrix} \hat{y_{1, 1}} - y_{1, 1} \\ \hat{y_{2, 1}} - y_{2, 1} \end{bmatrix}

Looking at the accumulated gradients for the weights W^{2} we can see that:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{W^2}} = \begin{bmatrix} \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{1}_{1,1}}} \\ \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{1}_{2,1}}} \\ \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{1}_{3,1}}}\end{bmatrix}

=\begin{bmatrix} (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,1})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,1}) \\ (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,2})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,2}) \\ (\hat{y}_{1, 1} - y_{1, 1})(h^1_{1,3})+(\hat{y}_{2, 1} - y_{2, 1})(h^1_{2,3}) \end{bmatrix}

= \begin{bmatrix} h^1_{1, 1} & h^1_{2, 1} \\ h^1_{1, 2} & h^1_{2, 2}\\ h^1_{1, 3} & h^1_{2, 3} \end{bmatrix} \begin{bmatrix} \hat{y_{1, 1}} - y_{1, 1} \\ \hat{y_{2, 1}} - y_{2, 1} \end{bmatrix}

= (H^{1})^{T} \mathbf{\delta^{1}}

For the biases, we simply sum up columns of \mathbf{\delta^{1}}:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{B^2}} = \begin{bmatrix}\sum_{i=1}^{2}\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{\hat{y}_{i, 1}}}\end{bmatrix}

= \begin{bmatrix} (\hat{y}_{1, 1} - y_{1, 1}) +(\hat{y}_{2, 1} - y_{2, 1}) \end{bmatrix}

= \begin{bmatrix} \sum_{i=1}^{2}\delta^{1}_{i,1}\end{bmatrix}

Finally, for the hidden states H^{1}:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{H^1}} = \begin{bmatrix} \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,1}}} & \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,2}}} & \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,3}}} \\\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,1}}} &\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,2}}} &\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,3}}}\end{bmatrix}

= \begin{bmatrix} (\hat{y_{1, 1}}-y_{1, 1})(w^{2}_{1,1}) &(\hat{y_{1, 1}}-y_{1, 1})(w^{2}_{2,1}) &(\hat{y_{1, 1}}-y_{1, 1})(w^{2}_{3,1}) \\(\hat{y_{2, 1}}-y_{2, 1})(w^{2}_{1,1}) &(\hat{y_{2, 1}}-y_{2, 1})(w^{2}_{2,1}) &(\hat{y_{2, 1}}-y_{2, 1})(w^{2}_{3,1}) \end{bmatrix}

= \begin{bmatrix} \hat{y_{1, 1}} - y_{1, 1} \\ \hat{y_{2, 1}} - y_{2, 1} \end{bmatrix} \begin{bmatrix} w^{2}_{1, 1} & w^{2}_{2,1} & w^{2}_{3,1} \end{bmatrix}

=\mathbf{\delta^{1}}(W^2)^{T}

Step 4: Calculating the accumulated gradients for L^{1}

Alright! So we are nearly there, for the last step we need to find the derivative of the hidden layer with respect to the non-linearities, in the case of our neural network, we have used a sigmoid function! This step is the crucial step, as after this step, we can simply rinse and repeat.

A sigmoid function is defined as:

\sigma(\cdot) = \frac{1}{1+exp(-(\cdot))}

It’s derivative is defined as:

\sigma'(\cdot) = \sigma(\cdot)(1-\sigma(\cdot))

I will leave the derivation of the derivative as an exercise! For now we will continue on with the last step.

Layer 2 linear
Figure 9: Accumulated gradients for H^{1}

Looking at the diagram above if we follow the arrow from the loss to the nodes in the linear layer L^{1} then we should be able to see that for each node in the linear layer:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^1_{i,j}}}=\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^1_{i,j}}}\frac{\partial{h^1_{i,j}}}{\partial{l^1_{i,j}}}

Since each element of H^1 is just L^{1} with a sigmoid to it:

\frac{\partial{h^1_{i,j}}}{\partial{l^1_{i,j}}} =\frac{\partial{\sigma (l^1_{i,j})}}{\partial{l^1_{i,j}}}  = \sigma(l^1_{i,j})(1-\sigma(l^1_{i,j}))

Or put more simply, the accumulated gradients from the loss layer up to the inputs of the layer before the non-linearity is calculated as:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^1_{i,j}}}

=\begin{bmatrix} \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,1}}} & \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,2}}} & \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{1,3}}} \\\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,1}}} &\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,2}}} &\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{h^{1}_{2,3}}}\end{bmatrix} \circ \begin{bmatrix} \frac{\partial{h^{1}_{1,1}}}{\partial{l^{1}_{1,1}}} & \frac{\partial{h^{1}_{1,2}}}{\partial{l^{1}_{1,2}}} & \frac{\partial{h^{1}_{1,3}}}{\partial{l^{1}_{1,3}}} \\ \frac{\partial{h^{1}_{2,1}}}{\partial{l^{1}_{2,1}}} & \frac{\partial{h^{1}_{2,2}}}{\partial{l^{1}_{2,2}}} & \frac{\partial{h^{1}_{2,3}}}{\partial{l^{1}_{2,3}}} \end{bmatrix}

= \mathbf{\delta^{1}}(W^2)^{T} \circ {\sigma'(L^{1})}

There are two things that I should explain here the circle, \circ is called the Hadamard product, which is an element-wise multiplication of two vectors/matrices with the same dimension.

Also, I have abused the notation a bit where I am denoting \sigma'(L^{1}) as:

{\sigma'(L^{1})} =\begin{bmatrix} \frac{\partial{\sigma(l^{1}_{1,1})}}{\partial{l^{1}_{1,1}}} & \frac{\partial{\sigma(l^{1}_{1,2})}}{\partial{l^{1}_{1,2}}} & \frac{\partial{\sigma(l^{1}_{1,3})}}{\partial{l^{1}_{1,3}}} \\ \frac{\partial{\sigma(l^{1}_{2,1})}}{\partial{l^{1}_{2,1}}} & \frac{\partial{\sigma(l^{1}_{2,2})}}{\partial{l^{1}_{2,2}}} & \frac{\partial{\sigma(l^{1}_{2,3})}}{\partial{l^{1}_{2,3}}} \end{bmatrix}=\begin{bmatrix} \frac{\partial{h^{1}_{1,1}}}{\partial{l^{1}_{1,1}}} & \frac{\partial{h^{1}_{1,2}}}{\partial{l^{1}_{1,2}}} & \frac{\partial{h^{1}_{1,3}}}{\partial{l^{1}_{1,3}}} \\ \frac{\partial{h^{1}_{2,1}}}{\partial{l^{1}_{2,1}}} & \frac{\partial{h^{1}_{2,2}}}{\partial{l^{1}_{2,2}}} & \frac{\partial{h^{1}_{2,3}}}{\partial{l^{1}_{2,3}}} \end{bmatrix}

Finally, we are able to make the abstraction for \mathbf{\delta^{2}}:

\mathbf{\delta^{2}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^1_{i,j}}} = \mathbf{\delta^{1}}(W^2)^{T} \circ {\sigma'(L^{1})}

Each element in row i and column j in this \mathbf{\delta^{2}} are the accumulated gradients from the error up to l^{1}_{i,j}.

Step 5: Rinse and repeat

Here is our updated diagram!

Final.png
Figure 10: Accumulated gradients for L^{1}

Now I am not going into as much detail, though I do urge you to follow the steps that I just went through to find the accumulated gradients for the weights, biases, and inputs for the input layer.

Input weights
Figure 11: Accumulated gradients for w^{1}_{1,1}

Looking at figure 11 we can see that:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{w^{1}_{i,j}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^1_{1, j}}} \frac{\partial{l^1_{1, j}}}{\partial{w^{1}_{i,j}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^1_{2, j}}} \frac{\partial{l^1_{2, j}}}{\partial{w^{1}_{i,j}}}

= \delta^{2}_{1,j}x_{1,j}+\delta^{2}_{2,j}x_{2,j}

Or more generally for any batch size B and any layer l:

\frac{\partial{L}}{\partial{w^{l}_{i,j}}} = \sum_{b=1}^{B}\frac{\partial{L}}{\partial{l^l_{b, j}}} \frac{\partial{l^l_{b, j}}}{\partial{w^{l}_{i,j}}}=\delta^{L-l}_{b,j}h^{l-1}_{b,j}

Which is the (b, j)th element of:

\frac{\partial{L}}{\partial{W^{l}}} = (H^{l-1})^T\mathbf{\delta^{L-l}}

where:

H^0 = X

Note that in our neural network L=3 we have the input layer, the hidden layer, and the output layer.

Input bias
Figure 12: Accumulated gradients for b^{1}_{1}

Looking at figure 12 we can see that:

\frac{\partial{L}}{\partial{B^{1}}} = \begin{bmatrix} \sum_{b=1}^{2}\delta^{l}_{b,1}  &\sum_{b=1}^{2}\delta^{l}_{b,2}  & \sum_{b=1}^{2}\delta^{l}_{b,3} \end{bmatrix}

Or more generally, for any batch size B and layer l:

\frac{\partial{L}}{\partial{B^{l}}} = \begin{bmatrix} \sum_{b=1}^{B}\delta^{l}_{b,1} \dots\sum_{b=1}^{B}\delta^{l}_{b,D^{l}} \end{bmatrix}

where:

D^{l} is the number of hidden nodes in layer l

Input layer
Figure 13: Accumulated gradients for x_{1,1} or alternatively h^0_{1,1}

Looking at figure 13, we can see that:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{x_{1,1}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{1, 1}}} \frac{\partial{l^l_{1, 1}}}{\partial{x_{1,1}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{1, 2}}} \frac{\partial{l^l_{1, 2}}}{\partial{x_{1,2}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{1, 3}}} \frac{\partial{l^l_{1, 3}}}{\partial{x_{i,j}}}

=\delta^1_{1,1}w^{1}_{1,1} + \delta^1_{1,2}w^{1}_{1,2} + \delta^1_{1,3}w^{1}_{1,3}

Or more generally, for any batch size B and layer l:

\frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{x_{b,j}}} = \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{i, 1}}} \frac{\partial{l^l_{b, 1}}}{\partial{x_{b,j}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{b, 2}}} \frac{\partial{l^l_{b, 2}}}{\partial{x_{b,j}}} + \frac{\partial{L(\hat{y}_{1, 1}, y_{1, 1}, \hat{y}_{2, 1}, y_{2, 1})}}{\partial{l^l_{b, 3}}} \frac{\partial{l^l_{b, 3}}}{\partial{x_{b,j}}}

=\delta^{L-l}_{b,1}w^{l}_{j,1} + \delta^{L-l}_{b,2}w^{l}_{j,2} + \delta^{L-l}_{b,3}w^{l}_{j,3}

Which is the (b, j)th element of:

\frac{\partial{L}}{\partial{H^{l}}} = \mathbf{\delta^{L-l}}(W^{l})^T

 

Conclusion

So we have derived the backpropagation equations:

\frac{\partial{L}}{\partial{W^{l}}} = (H^{l-1})^T\mathbf{\delta^{L-l}}

\frac{\partial{L}}{\partial{B^{l}}} = \begin{bmatrix} \sum_{b=1}^{B}\delta^{l}_{b,1} \dots\sum_{b=1}^{B}\delta^{l}_{b,D^{l}} \end{bmatrix}

\frac{\partial{L}}{\partial{H^{l}}} = \mathbf{\delta^{L-l}}(W^{l})^T

If you are interested in playing around with the diagrams that I have used here is the link (please copy the diagrams and place them in a new file so that other people can use the template). I would urge you to place around with it and derive the backpropagation equations in detail for the input layer!

Also, if you’re interested in a more non-trivial example please take a look at my other post which is an example of backpropagation in a four layer neural network using cross entropy loss.

I hope that this has made backpropagation a lot clearer! Please let me know in the comments below if I have made any mistakes or if you have any questions or any requests on what I should blog about next!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s