An example of backpropagation in a four layer neural network using cross entropy loss


Update: I have written another post deriving backpropagation which has more diagrams and I recommend reading the aforementioned post first!

The backpropagation algorithm can be argued to be the most important contribution to the field of deep learning. In fact, it is because of this algorithm, and the increasing power of GPUs, that the deep learning era we are experiencing today is even possible!

There are many great articles online that explain how backpropagation work (my favorite is Christopher Olah’s post), but not many examples of backpropagation in a non-trivial setting. Examples I found online only showed backpropagation on simple neural networks (1 input layer, 1 hidden layer, 1 output layer) and they only used 1 sample data during the backward pass.

The problem of using this simple example is two-fold:

  1. It misses out on the main concept of the backpropagation algorithm: reusing the gradients of the previously calculated layers through matrix multiplications
  2.  In practice, neural networks aren’t just trained by feeding it one sample at a time, but rather in batches (usually in powers of 2).

As a result, it was a struggle for me to make the mental leap from understanding how backpropagation worked in a trivial neural network to the current state of the art neural networks. These state of the art neural networks consist of many layers and are trained by feeding in batches of examples, not one by one.

So in this post, I will attempt to work through the math of the backward pass of a four-layer neural network; in the process of which I will explain how the backpropagation algorithm can be generalized to a neural network of arbitrary depth and take an arbitrary number of samples as input.

Diagram of our neural network

We start off by defining our 4 layer network:

4 layer NN.png

I am going to assume that you are comfortable with the forward pass in a neural network and so I’m going to jump straight into the backward pass! We will first start off with using only 1 sample in the backward pass, then afterward we will see how to extend it to use more than 1 sample.

The output layer and loss function

The output layer of our neural network is a vector of probabilities from the softmax function whereby the inputs of the softmax function is a vector \mathbf{z}^3:

\mathbf{z}^3 = \begin{bmatrix} z^3_1 & z^3_2 & z^3_3 \end{bmatrix}

The k^{th} element of the output for the softmax function is:

\hat{y}_k = \mathbf{\sigma}(\mathbf{z}^{3})_k = \frac{exp(z^{3}_k)}{\sum_{i=1}^{3}exp(z_i^3)}

As the output of the softmax function (which is also our output layer) is multi-valued it can be succinctly represented by a vector, we will use the vector \mathbf{\hat{y}} as these are the predicted probabilities:

\mathbf{\hat{y}} = \begin{bmatrix} \hat{y}_1 & \hat{y}_2 & \hat{y}_3 \end{bmatrix}

The output of the softmax function \mathbf{\hat{y}} are then used as inputs to our loss function, the cross entropy loss:

H(\mathbf{\hat{y}},\mathbf{y}) := - \sum_{i=1}^3 y_{i} \log (\hat{y}_i)

where \mathbf{y} is a one-hot vector.

Now we have all the information that we need to start the first step of the backpropagation algorithm! Our goal is to find how our loss function H changes with respect to \mathbf{\hat{z}}^3. Since \mathbf{\hat{z}}^3 has 3 variables, we need to find how H changes with each of them. We can simplify this problem by first examining how H changes with z^3_1.

output with gradients.png

Now our goal is to find:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_1}

By looking at the diagram above we see that this is simply:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_1} = \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_1} \frac{\partial\hat{y}_1}{\partial z^3_1} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_2}\frac{\partial\hat{y}_2}{\partial z^3_1} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_3} \frac{\partial\hat{y}_3}{\partial z^3_1}

Intuitively, the reason why the gradients from all three paths are added is because \mathbf{H} is a function of \hat{y}_1\hat{y}_2, and \hat{y}_3, which are all a function of \mathbf{z}^3_1. Therefore, to calculate the change of \mathbf{H} with respect to z^3_1 we have to include all the inputs to \mathbf{H}, as if any of them changes\mathbf{H} would change; this is the reason we add all three paths. Though, please read Christopher Olah’s post on how back propagation works as it gives a much clearer explanation.

If we change z^3_1 to z^3_2 or z^3_3 it should be easy to convince yourself that:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_2} = \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_1} \frac{\partial\hat{y}_1}{\partial z^3_2} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_2}\frac{\partial\hat{y}_2}{\partial z^3_2} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_3} \frac{\partial\hat{y}_3}{\partial z^3_2}

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_3} = \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_1} \frac{\partial\hat{y}_1}{\partial z^3_3} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_2}\frac{\partial\hat{y}_2}{\partial z^3_3} +\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_3} \frac{\partial\hat{y}_3}{\partial z^3_3}

In fact, after staring at it for a while we can see that:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \begin{bmatrix} \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_1} &\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_2} & \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial z^3_3} \end{bmatrix} = \begin{bmatrix} \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_1} &\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_2} & \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_3} \end{bmatrix} \begin{bmatrix} \frac{\partial\hat{y}_1}{\partial z^3_1} & \frac{\partial\hat{y}_1}{\partial z^3_2} & \frac{\partial\hat{y}_1}{\partial z^3_3}\\ \frac{\partial\hat{y}_2}{\partial z^3_1} & \frac{\partial\hat{y}_2}{\partial z^3_2} & \frac{\partial\hat{y}_2}{\partial z^3_3}\\ \frac{\partial\hat{y}_3}{\partial z^3_1} & \frac{\partial\hat{y}_3}{\partial z^3_2} & \frac{\partial\hat{y}_3}{\partial z^3_3} \end{bmatrix}


\begin{bmatrix} \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_1} &\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_2} & \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\hat{y}_3} \end{bmatrix} = \begin{bmatrix} -\frac{y_1}{\hat{y}_1} & -\frac{y_2}{\hat{y}_2} & -\frac{y_3}{\hat{y}_3} \end{bmatrix} =\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\mathbf{\hat{y}}}

is the partial derivative of our loss function H with its inputs \mathbf{\hat{y}}


\begin{bmatrix} \frac{\partial\hat{y}_1}{\partial z^3_1} & \frac{\partial\hat{y}_1}{\partial z^3_2} & \frac{\partial\hat{y}_1}{\partial z^3_3}\\ \frac{\partial\hat{y}_2}{\partial z^3_1} & \frac{\partial\hat{y}_2}{\partial z^3_2} & \frac{\partial\hat{y}_2}{\partial z^3_3}\\ \frac{\partial\hat{y}_3}{\partial z^3_1} & \frac{\partial\hat{y}_3}{\partial z^3_2} & \frac{\partial\hat{y}_3}{\partial z^3_3} \end{bmatrix} = \begin{bmatrix} \hat{y}_1(1-\hat{y}_1) & -\hat{y}_1\hat{y}_2 & -\hat{y}_1\hat{y}_3\\ -\hat{y}_2\hat{y}_1 & \hat{y}_2(1-\hat{y}_2) & -\hat{y}_2\hat{y}_3\\ -\hat{y}_3\hat{y}_1 & -\hat{y}_3\hat{y}_2 & \hat{y}_3(1-\hat{y}_3) \end{bmatrix}= \frac{\partial \mathbf{\hat{y}}}{\partial \mathbf{z}^3}

is the Jacobian of the softmax function (this might not immediately obvious but take it for granted now, I might do a post on deriving the Jacobian of the softmax function in the future!).

One interesting observation is that the columns of the Jacobian represents the edges leading into z^3_1, z^3_2, z^3_3. For example the first column of the Jacobian represents the edges leading into z^3_1 from \hat{y}_1,\hat{y}_2,\hat{y}_3.

In fact this something that we will see often during the derivation of backpropagation whereby the columns of the Jacobian between layer l and layer l-1  represents the edges leading in from layer l to a node in layer l-1.

So we are nearly there! We know that:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial\mathbf{\hat{y}}}\frac{\partial \mathbf{\hat{y}}}{\partial \mathbf{z}^3} =\begin{bmatrix} -\frac{y_1}{\hat{y}_1} & -\frac{y_2}{\hat{y}_2} & -\frac{y_3}{\hat{y}_3} \end{bmatrix} \begin{bmatrix} \hat{y}_1(1-\hat{y}_1) & -\hat{y}_1\hat{y}_2 & -\hat{y}_1\hat{y}_3\\ -\hat{y}_2\hat{y}_1 & \hat{y}_2(1-\hat{y}_2) & -\hat{y}_2\hat{y}_3\\ -\hat{y}_3\hat{y}_1 & -\hat{y}_3\hat{y}_2 & \hat{y}_3(1-\hat{y}_3) \end{bmatrix}

So after the matrix multiplications we get:

\begin{bmatrix} (\sum_{i \neq 1}^3 y_i \hat{y}_1)-y_1(1- \hat{y_1}) & (\sum_{i \neq 2}^3y_i \hat{y}_2)-y_2(1- \hat{y_2}) & (\sum_{i \neq 3}^3y_i \hat{y}_3)-y_3(1- \hat{y_3}) \end{bmatrix}

Since \mathbf{y} is a one-hot vector it means that only one element of the vector \mathbf{y} be will 1 the rest will be 0s. Assuming that the class being represented by the first element of \mathbf{y} is true, then:

\mathbf{y} = \begin{bmatrix} 1 & 0 & 0 \end{bmatrix}

The result of the matrix multiplications becomes:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \begin{bmatrix}-(1- \hat{y_1}) & \hat{y_2} & \hat{y_3} \end{bmatrix} =\begin{bmatrix} \hat{y_1} - 1 & \hat{y_2} & \hat{y_3} \end{bmatrix}

Again, it should be easy to convince yourself that if the class being represented by the second element of \mathbf{y} is true, then:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \begin{bmatrix} \hat{y_1} & \hat{y_2} - 1 & \hat{y_3} \end{bmatrix}

So after all that hard work we see that \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} can simply be represented by:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \mathbf{\hat{y}} - \mathbf{y} =\delta^1 = \begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}

I want to take a bit of time to explain the importance of abstracting away the derivatives of the previous layers using \delta^1. We could try not using \delta^1 and keep it as \mathbf{\hat{y}} - \mathbf{y} although it seems fine now, as we go deeper backwards into the network we would have a long chain of variables that we would need to keep track of. This is a problem because it would be near impossible to implement in code and the derivation would be a lot harder to understand.

Hidden Layer 2

The second hidden layer of our neural network produces the inputs \mathbf{z}^3 to our softmax function in the output layer, whereby \mathbf{z}^3 is defined as:

\mathbf{z}^3 = \mathbf{h}^2W^3+\mathbf{b}^3


\mathbf{h}^2 = \begin{bmatrix} h^2_1 & h^2_2 & h^2_3 & h^2_4 \\\end{bmatrix}

W^3 = \begin{bmatrix} w_{1,1}^3 & w_{1,2}^3 & w_{1,3}^3 \\ w_{2,1}^3 & w_{2,2}^3 & w_{2,3}^3 \\ w_{3,1}^3 & w_{3,2}^3 & w_{3,3}^3 \\ w_{4,1}^3 & w_{4,2}^3 & w_{4,3}^3 \\\end{bmatrix}

\mathbf{b}^3 = \begin{bmatrix} b^3_1 & b^3_2 & b^3_3 \\\end{bmatrix}

So we know how to find \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} so now our goal is to find:

  • \frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2}
  • \frac{\partial \mathbf{z}^3}{\partial W^3}
  • \frac{\partial \mathbf{z}^3}{\partial \mathbf{b}^3}

After we find these gradients we can simply multiply them by \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \delta^1 to get the gradients of \mathbf{h}^2,W^3, \mathbf{b}^2 with respect to our loss function H.

For example:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}= \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} = \delta^1\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2}

Finding \frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2}:

We first start with trying to find \frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} , like before we will make the problem easier by first trying to find  \frac{\partial \mathbf{z}^3}{\partial{h}^2_1}.

hidden layer 2 with gradients.png

We can represent \frac{\partial \mathbf{z}^3}{\partial{h}^2_1} as a vector of partial derivatives like so:

\frac{\partial \mathbf{z}^3}{\partial{h}^2_1} = \begin{bmatrix}\frac{\partial z^3_1}{\partial{h}^2_1} \\\frac{\partial z^3_2}{\partial{h}^2_1} \\\frac{\partial z^3_3}{\partial{h}^2_1} \end{bmatrix}

By replacing h^2_1 with any other node in hidden layer 2, we can see that:

\frac{\partial \mathbf{z}^3}{\partial{h}^2_2} = \begin{bmatrix}\frac{\partial z^3_1}{\partial{h}^2_2} \\\frac{\partial z^3_2}{\partial{h}^2_2} \\\frac{\partial z^3_3}{\partial{h}^2_2} \end{bmatrix}

\frac{\partial \mathbf{z}^3}{\partial{h}^2_3} = \begin{bmatrix}\frac{\partial z^3_1}{\partial{h}^2_3} \\\frac{\partial z^3_2}{\partial{h}^2_3} \\\frac{\partial z^3_3}{\partial{h}^2_3} \end{bmatrix}

\frac{\partial \mathbf{z}^3}{\partial{h}^2_4} = \begin{bmatrix}\frac{\partial z^3_1}{\partial{h}^2_4} \\\frac{\partial z^3_2}{\partial{h}^2_4} \\\frac{\partial z^3_3}{\partial{h}^2_4} \end{bmatrix}

By concatentating these vectors we get \frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2}:

\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} =\begin{bmatrix}\frac{\partial \mathbf{z}^3}{\partial{h}^2_1} &\frac{\partial \mathbf{z}^3}{\partial{h}^2_2} &\frac{\partial \mathbf{z}^3}{\partial{h}^2_3} &\frac{\partial \mathbf{z}^3}{\partial{h}^2_4} \end{bmatrix}\ = \begin{bmatrix} \frac{\partial z^3_1}{\partial{h}^2_1} &\frac{\partial z^3_1}{\partial{h}^2_2}&\frac{\partial z^3_1}{\partial{h}^2_3}&\frac{\partial z^3_1}{\partial{h}^2_4}\\ \frac{\partial z^3_2}{\partial{h}^2_1} &\frac{\partial z^3_2}{\partial{h}^2_2} &\frac{\partial z^3_2}{\partial{h}^2_3}&\frac{\partial z^3_2}{\partial{h}^2_4} \\ \frac{\partial z^3_3}{\partial{h}^2_1}&\frac{\partial z^3_3}{\partial{h}^2_2}&\frac{\partial z^3_3}{\partial{h}^2_3}&\frac{\partial z^3_3}{\partial{h}^2_4} \end{bmatrix}

This is the Jacobian between layers \mathbf{z}^3 and \mathbf{h}^2. Again we can see that the columns of the Jacobian represents the edges from all the nodes in layer \mathbf{z}^3 to a node in layer \mathbf{h}^2.

All that is left now is to find the derivatives in the matrix we first start by finding \frac{\partial z^3_1}{\partial{h}^2_1}:

\frac{\partial z^3_1}{\partial{h}^2_1} =\frac{\partial}{\partial{h}^2_3}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1)=w_{1,1}^2

Let’s find a few more to see if we can find a pattern:

\frac{\partial z^3_1}{\partial{h}^2_2} =\frac{\partial}{\partial{h}^2_1}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1)=w_{2,1}^2

\frac{\partial z^3_1}{\partial{h}^2_3} =\frac{\partial}{\partial{h}^2_3}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1)=w_{3,1}^2

\frac{\partial z^3_1}{\partial{h}^2_4} =\frac{\partial}{\partial{h}^2_4}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1)=w_{4,1}^2

\frac{\partial z^3_2}{\partial{h}^2_1} =\frac{\partial}{\partial{h}^2_1}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2)=w_{1,2}^2

Hopefully now, it shouldn’t be hard to convince yourself now that:

\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} = \begin{bmatrix} w_{1,1}^2 & w_{2,1}^2 & w_{3,1}^2 & w_{4,1}^2 \\ w_{1,2}^2 &w_{2,2}^2 &w_{3,2}^2 &w_{4,2}^2 \\ w_{1,3}^2 &w_{2,3}^2 &w_{3,3}^2 &w_{4,3}^2 \end{bmatrix} = (W^3)^T

Again this is a pattern that will start to emerge as we continue on. The Jacobian between layers l and l-1 is simply the transpose of the weight matrix connecting them.

Now finally putting it all together:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}= \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} = \delta^1 (W^3)^T

Note how we are able to express \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2} as a function our of previously calculated gradients \delta^1, there was no need for us to recalculate it from scratch.

Finding \frac{\partial \mathbf{z}^3}{\partial W^3}:

At first this might seem very daunting at the start, but again if we just find \frac{\partial \mathbf{z}^3}{\partial w^3_{1,1}} first and work from there we will see that it isn’t as hard as it looks!

\frac{\partial \mathbf{z}^3}{\partial w^3_{1,1}} = \begin{bmatrix}\frac{\partial z^3_1}{\partial w^3_{1,1}} \\\frac{\partial z^3_2}{\partial w^3_{1,1}} \\\frac{\partial z^3_3}{\partial w^3_{1,1}} \end{bmatrix} = \begin{bmatrix}\frac{\partial }{\partial w^3_{1,1}}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1) \\\frac{\partial }{\partial w^3_{1,1}}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2) \\ \frac{\partial }{\partial w^3_{1,1}}(h^2_1w^3_{1,3}+h^2_2w^3_{2,3}+h^2_3w^3_{3,3}+h^2_4w^3_{4,3}+b^3_3) \end{bmatrix} = \begin{bmatrix}h^2_1 \\0\\0 \end{bmatrix}

Lets take 2 more elements of the the weight matrix and find it’s derivative with respect to \mathbf{z}^3:

\frac{\partial \mathbf{z}^3}{\partial w^3_{3,2}} = \begin{bmatrix}\frac{\partial z^3_1}{\partial w^3_{3,2}} \\\frac{\partial z^3_2}{\partial w^3_{3,2}} \\\frac{\partial z^3_3}{\partial w^3_{3,2}} \end{bmatrix} = \begin{bmatrix}\frac{\partial }{\partial w^3_{3,2}}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1) \\\frac{\partial }{\partial w^3_{3,2}}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2) \\ \frac{\partial }{\partial w^3_{3,2}}(h^2_1w^3_{1,3}+h^2_2w^3_{2,3}+h^2_3w^3_{3,3}+h^2_4w^3_{4,3}+b^3_3) \end{bmatrix} = \begin{bmatrix}0 \\h^2_3\\0 \end{bmatrix}

\frac{\partial \mathbf{z}^3}{\partial w^3_{2,3}} = \begin{bmatrix}\frac{\partial z^3_1}{\partial w^3_{2,3}} \\\frac{\partial z^3_2}{\partial w^3_{2,3}} \\\frac{\partial z^3_3}{\partial w^3_{2,3}} \end{bmatrix} = \begin{bmatrix}\frac{\partial }{\partial w^3_{2,3}}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1) \\\frac{\partial }{\partial w^3_{2,3}}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2) \\ \frac{\partial }{\partial w^3_{2,3}}(h^2_1w^3_{1,3}+h^2_2w^3_{2,3}+h^2_3w^3_{3,3}+h^2_4w^3_{4,3}+b^3_3) \end{bmatrix} = \begin{bmatrix}0 \\0\\h^2_2 \end{bmatrix}

So we can see that for any w^3_{i,j}:

\frac{\partial \mathbf{z}^3}{\partial w^3_{i,j}} = \begin{bmatrix}\frac{\partial z^3_1}{\partial w^3_{i,j}} \\\frac{\partial z^3_2}{\partial w^3_{i,j}} \\\frac{\partial z^3_3}{\partial w^3_{i,j}} \end{bmatrix} = \begin{bmatrix}\frac{\partial }{\partial w^3_{i,j}}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1) \\\frac{\partial }{\partial w^3_{i,j}}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2) \\ \frac{\partial }{\partial w^3_{i,j}}(h^2_1w^3_{1,3}+h^2_2w^3_{2,3}+h^2_3w^3_{3,3}+h^2_4w^3_{4,3}+b^3_3) \end{bmatrix} = \begin{bmatrix}\mathbf{1}_{(j=1)}h ^2_i\\\mathbf{1}_{(j=2)}h^2_i\\\mathbf{1}_{(j=3)}h^2_i\end{bmatrix}

where: \mathbf{1}_{(\cdot)} is the indicator function, it simply says that if the condition within the brackets are true then it equals 1, otherwise it equals 0.

Now it might still seem a bit complicated due to the indicator functions, but regardless let’s push on and later we will see that this simplifies to something very elegant! Let us now try to find \frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{i,j}}!

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{i,j}} =\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial w^3_{i,j}}=\delta^1 \frac{\partial \mathbf{z}^3}{\partial w^3_{i,j}} =\begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}\begin{bmatrix}\mathbf{1}_{(j=1)}h ^2_i\\\mathbf{1}_{(j=2)}h^2_i\\\mathbf{1}_{(j=3)}h^2_i\end{bmatrix}

Now lets try to find \frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{1,1}} and see if we can see any patterns:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{1,1}} = \begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}\begin{bmatrix} h^2_1 \\ 0 \\ 0 \end{bmatrix} = \delta^1_{1}h^2_1

Lets try a few more elements of W^3:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{3,2}} = \begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}\begin{bmatrix} 0 \\ h^2_3 \\ 0 \end{bmatrix} = \delta^1_{2}h^2_3

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{2,3}} = \begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}\begin{bmatrix} 0  \\ 0 \\h^2_2 \end{bmatrix} = \delta^1_{3}h^2_2

Now if we stare at these examples a bit we can see that:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{i,j}} = \delta^1_{j}h^2_i

Now we can write out \frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3}:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3} = \begin{bmatrix}\delta^1_{1}h^2_1 & \delta^1_{2}h^2_1 & \delta^1_{3}h^2_1 \\ \delta^1_{1}h^2_2 & \delta^1_{2}h^2_2 & \delta^1_{3}h^2_2 \\ \delta^1_{1}h^2_3 & \delta^1_{2}h^2_3 & \delta^1_{3}h^2_3\\ \delta^1_{1}h^2_4 & \delta^1_{2}h^2_4 & \delta^1_{3}h^2_4\end{bmatrix}

Now again if we stare at it for long enough we can see that this is simply (well maybe not that simple):

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3} = \begin{bmatrix}\delta^1_{1}h^2_1 & \delta^1_{2}h^2_1 & \delta^1_{3}h^2_1 \\ \delta^1_{1}h^2_2 & \delta^1_{2}h^2_2 & \delta^1_{3}h^2_2 \\ \delta^1_{1}h^2_3 & \delta^1_{2}h^2_3 & \delta^1_{3}h^2_3\\ \delta^1_{1}h^2_4 & \delta^1_{2}h^2_4 & \delta^1_{3}h^2_4\end{bmatrix} = \begin{bmatrix}h^2_1\\h^2_2 \\h^2_3\\ h^2_4\end{bmatrix}\begin{bmatrix}\delta^1_{1}&\delta^1_{2}& \delta^1_{3}\end{bmatrix}=(\mathbf{h}^2)^T \delta^1

Finding \frac{\partial \mathbf{z}^3}{\partial \mathbf{b}^3}:

Again, since we are finding the change of a vector with respect to another vector we will have a Jacobian, although this one is substantially easier to calculate! As always we start by trying to find \frac{ \partial  \mathbf{z}^3}{ \partial b^3_{1}} then extend it from there:

\frac{\partial \mathbf{z}^3}{\partial b^3_{1}} = \begin{bmatrix}\frac{\partial z^3_1}{\partial b^3_{1}} \\\frac{\partial z^3_2}{\partial b^3_{1}} \\\frac{\partial z^3_3}{\partial b^3_{1}} \end{bmatrix} = \begin{bmatrix}\frac{\partial }{\partial b^3_{1}}(h^2_1w^3_{1,1}+h^2_2w^3_{2,1}+h^2_3w^3_{3,1}+h^2_4w^3_{4,1}+b^3_1) \\\frac{\partial }{\partial b^3_{1}}(h^2_1w^3_{1,2}+h^2_2w^3_{2,2}+h^2_3w^3_{3,2}+h^2_4w^3_{4,2}+b^3_2) \\ \frac{\partial }{\partial b^3_{1}}(h^2_1w^3_{1,3}+h^2_2w^3_{2,3}+h^2_3w^3_{3,3}+h^2_4w^3_{4,3}+b^3_3) \end{bmatrix} = \begin{bmatrix}1 \\0\\0 \end{bmatrix}

It shouldn’t be too hard to convince yourself that:

\frac{\partial \mathbf{z}^3}{\partial b^3_{2}} =  \begin{bmatrix} 0 \\1\\0 \end{bmatrix}


\frac{\partial \mathbf{z}^3}{\partial b^3_{3}} =  \begin{bmatrix} 0 \\0 \\1 \end{bmatrix}

Then we concatenate the vectors together to form \frac{\partial \mathbf{z}^3}{\partial \mathbf{b}^3}:

\frac{\partial \mathbf{z}^3}{\partial \mathbf{b}^3} = \mathbf{I}


\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^3}= \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{b}^3} = \delta^1 \mathbf{I}  = \delta^1

Let reiterate what we have derived:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \mathbf{\hat{y}} - \mathbf{y} =\delta^1 = \begin{bmatrix} \delta^1_{1} & \delta^1_{2} & \delta^1_{3} \end{bmatrix}

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}= \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2} = \delta^1 (W^3)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3} =(\mathbf{h}^2)^T \delta^1

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^3}=\delta^1

Hidden layer 1

The first hidden layer produces the inputs \mathbf{h}^2 to the second hidden layer where \mathbf{h}^2 is defined as:

\mathbf{h}^2 = \begin{bmatrix} \sigma(z^2_1) & \sigma(z^2_2)& \sigma(z^2_3)& \sigma(z^2_4) \end{bmatrix}

where: \sigma(\cdot) = \frac{1}{1+exp(-(\cdot))} and \sigma'(\cdot) = \sigma(\cdot)(1-\sigma(\cdot))

The first step is to find \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2} and express it using previously calculated values like we did for \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}. To do this we need to first find \frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2} after we have found it we can simply do:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2}= \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{h}^2}\frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2}= \delta^1 (W^3)^T\frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2}

to find \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2}.

To find \frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2} we will use the same trick we used before and find \frac{\partial \mathbf{h}^2}{\partial z^2_1} first and work from there:

\frac{\partial \mathbf{h}^2}{\partial z^2_1}= \begin {bmatrix} \frac{\partial}{\partial z^2_1}\sigma(z^2_1) \\ \frac{\partial}{\partial z^2_2}\sigma(z^2_1) \\ \frac{\partial}{\partial z^2_3}\sigma(z^2_1) \\ \frac{\partial}{\partial z^2_4}\sigma(z^2_1) \end{bmatrix} = \begin {bmatrix} \sigma(z^2_1) (1 - \sigma(z^2_1)) \\ 0 \\0 \\0 \end{bmatrix}

If we did the same thing for z^2_2, z^2_3, z^2_4 and concatenated their vectors we would get:

\frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2} = \begin {bmatrix} \sigma(z^2_1)(1 - \sigma(z^2_2)) & 0 & 0 & 0  \\ 0 &  \sigma(z^2_2)(1 - \sigma(z^2_2)) & 0 & 0\\0 & 0 & \sigma(z^2_3)(1 - \sigma(z^2_3)) & 0 \\0 & 0 & 0 & \sigma(z^2_4)(1 - \sigma(z^2_4)) \end{bmatrix}


\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2}=  \delta^1 (W^3)^T\frac{\partial \mathbf{h}^2}{\partial \mathbf{z}^2} = \delta^1 (W^3)^T \circ \sigma'(\mathbf{z}^2) = \delta^2


\sigma'(\mathbf{z}^2) = \begin{bmatrix} \sigma(z^2_1)(1 - \sigma(z^2_1)) & \sigma(z^2_2)(1 - \sigma(z^2_2))& \sigma(z^2_3)(1 - \sigma(z^2_3))& \sigma(z^2_4)(1 - \sigma(z^2_4)) \end{bmatrix}

Note here that \circ is called the Hadamard product, which is an element-wise multiplication of two vectors/matrices with the same dimension.

From here on out it’s all rinse and repeat, we just follow the same steps that we went through above! First lets define some variables!

\mathbf{z}^2 = \mathbf{h}^1W^2+\mathbf{b}^2


\mathbf{h}^1 = \begin{bmatrix} h^1_1 & h^1_2 \end{bmatrix}

W^2 = \begin{bmatrix} w_{1,1}^2 & w_{1,2}^2 & w_{1, 3}^2 & w_{1, 4}^2 \\ w_{2,1}^2 & w_{2,2}^2 & w_{2,3}^2 & w_{2,4}^2 \end{bmatrix}

\mathbf{b}^2 = \begin{bmatrix} b^2_1 & b^2_2 & b^2_3 & b^2_4 \\\end{bmatrix}

So we know how to find \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2} so now our goal is to find:

  • \frac{\partial \mathbf{z}^2}{\partial \mathbf{h}^1}
  • \frac{\partial \mathbf{z}^2}{\partial W^2}
  • \frac{\partial \mathbf{z}^2}{\partial \mathbf{b}^2}

If we go through the same steps as we did above then we would get:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^1} = \delta^2 (W^2)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^2} =(\mathbf{h}^1)^T \delta^2

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^2}=\delta^2

I will leave this for you to verify for yourself (as working through it yourself will provide you with much more value than me working through the example for you, though if you get stuck I’m more than happy to help!).

Input layer

The input layer produces the inputs \mathbf{h}^1 to the first hidden layer where \mathbf{h}^1 is defined as:

\mathbf{h}^1 = \begin{bmatrix} \sigma(z^1_1) & \sigma(z^1_2) \end{bmatrix}

Like last time we have to find \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^1} and express it using previously calculated values.

Then you should end up with something like this:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^1}=\delta^2 (W^2)^T \circ \sigma'(\mathbf{z}^1) = \delta^3


\sigma'(\mathbf{z}^1) = \begin{bmatrix} \sigma(z^1_1)(1 - \sigma(z^1_1)) & \sigma(z^1_2)(1 - \sigma(z^1_2)) \end{bmatrix}

You are so close to finally finishing one iteration of the backpropagation algorithm! For this last part we will again need to define some variables!


\mathbf{z}^1 = \mathbf{x}W^1+\mathbf{b}^1


\mathbf{x} = \begin{bmatrix} x_1 & x_2 \end{bmatrix}

W^1 = \begin{bmatrix} w_{1,1}^1 & w_{1,2}^1 & w_{1,3}^1 \\ w_{2,1}^1 & w_{2,2} & w_{2,3}^1 \end{bmatrix}

\mathbf{b}^1 = \begin{bmatrix} b^1_1 & b^1_2 & b^1_3 \\\end{bmatrix}

As we already know \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^1} our goal is to find:

  • \frac{\partial \mathbf{z}^1}{\partial \mathbf{x}}
  • \frac{\partial \mathbf{z}^1}{\partial W^1}
  • \frac{\partial \mathbf{z}^1}{\partial \mathbf{b}^1}

This is what you should be getting:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{x}} = \delta^3 (W^1)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^1} =(\mathbf{x})^T \delta^3

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^1}=\delta^3

Summary of gradients

Output layer:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \mathbf{\hat{y}} - \mathbf{y} =\delta^1

Hidden layer 2:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}=\delta^1 (W^3)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3} =(\mathbf{h}^2)^T \delta^1

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^3}=\delta^1

Hidden layer 1:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^2}=\delta^1 (W^3)^T \circ \sigma'(\mathbf{z}^2) = \delta^2

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^1} = \delta^2 (W^2)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^2} =(\mathbf{h}^1)^T \delta^2

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^2}=\delta^2

Input layer:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^1}=\delta^2 (W^2)^T \circ \sigma'(\mathbf{z}^1) = \delta^3

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{x}} = \delta^3 (W^1)^T

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^1} =(\mathbf{x})^T \delta^3

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^1}=\delta^3

There are a few interesting observations that can be made, assuming that we have a neural network with L layers where layer L is the output layer and layer 1 is the input layer \mathbf{x} (so to clarify \mathbf{input}^1 = \mathbf{x} and \mathbf{input}^2 = \mathbf{h}^1 and so on) then for all layers l:

  1. \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{input}^l}=\delta^{L-l} (W^{l})^T
  2. \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial W^l}= (\mathbf{input}^l)^T \delta^{L-l}
  3. \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^l}= \delta^{L-l}
  4. The abstraction step is always made for the gradient of the cost function with respect to the output of a layer. For example \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^1} = \delta^3 note how \mathbf{z}^1 is the output of the input layer. This is because the gradient of the cost function with respect to the output of the layer is used in the expression of the gradients of the cost function with respect to the weight, biases and inputs of the layer.
Extension to using more than 1 sample in the backward pass

Extending the backpropagation algorithm to take more than one sample is relatively straightforward, the beauty of using matrix notation is that we don’t really have to change anything! As an example let’s run the backward pass using 3 samples instead of 1 on the output layer and hidden layer 2.

For \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3}:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{z}^3} = \begin{bmatrix} \mathbf{\hat{y}}_1 - \mathbf{y}_1 \\  \mathbf{\hat{y}}_2 - \mathbf{y}_2  \\\mathbf{\hat{y}}_3 - \mathbf{y}_3 \end{bmatrix}= \begin{bmatrix} \delta^1_{1,1} & \delta^1_{1,2} & \delta^1_{1,3} \\ \delta^1_{2,1} &  \delta^1_{2,2} &  \delta^1_{2,3} \\ \delta^1_{3,1} & \delta^1_{3,2} & \delta^1_{3,3} \end{bmatrix} = \delta^1

For \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{h}^2}=\delta^1 (W^3)^T  =\begin{bmatrix} \delta^1_{1,1} & \delta^1_{1,2} & \delta^1_{1,3} \\ \delta^1_{2,1} &  \delta^1_{2,2} &  \delta^1_{2,3} \\ \delta^1_{3,1} & \delta^1_{3,2} & \delta^1_{3,3} \end{bmatrix}\begin{bmatrix} w_{1,1}^3 & w_{2,1}^3 & w_{3,1}^3 & w_{4, 1}^3 \\ w_{1,2}^3 & w_{2,2}^3 & w_{3,2}^3 & w_{4,2} \\ w_{1,3}^3 & w_{2,3}^3 & w_{3,3}^3 & w_{4,3}\end{bmatrix}

For \frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3}:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial W^3} =(\mathbf{h}^2)^T \delta^1 =\begin{bmatrix} h^2_{1,1} & h^2_{2, 1} & h^2_{3, 1} \\h^2_{1,1} & h^2_{2, 1} & h^2_{3, 1} \\h^2_{1,1} & h^2_{2, 1} & h^2_{3, 1} \\h^2_{1,1} & h^2_{2, 1} & h^2_{3, 1} \\ \end{bmatrix}\begin{bmatrix} \delta^1_{1,1} & \delta^1_{1,2} & \delta^1_{1,3} \\ \delta^1_{2,1} &  \delta^1_{2,2} &  \delta^1_{2,3} \\ \delta^1_{3,1} & \delta^1_{3,2} & \delta^1_{3,3} \end{bmatrix}

Here we see that:

\frac{\partial \mathbf{H(y, \hat{y}})}{\partial w^3_{1,1}} = h^2_{1, 1} \delta^1_{1, 1} + h^2_{2, 1} \delta^1_{2, 1} + h^2_{3, 1} \delta^1_{3, 1}

If we think about it it makes sense, each term in the summation is how much the cost function would change with respect to w^3_{1, 1} for each sample. For example,  h^2_{1, 1} \delta^1_{1, 1} is how much the cost function would change with respect to w^3_{1, 1} for sample 1. So since we have 3 samples, we need to sum the change of the cost function with respect to w^3_{1, 1} for all of the samples to find the total change in the cost function.

For \frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^3}:

This is the only term that will change when we are extending from 1 sample to more than one samples. Before I show you what the answer is lets first take a look at \delta^1:

\delta^1=\begin{bmatrix} \delta^1_{1,1} & \delta^1_{1,2} & \delta^1_{1,3} \\ \delta^1_{2,1} &  \delta^1_{2,2} &  \delta^1_{2,3} \\ \delta^1_{3,1} & \delta^1_{3,2} & \delta^1_{3,3} \end{bmatrix}

Now, remember from before that \delta^1_{1, 1} is how much the cost function changes with respect to b^3_1 for sample 1; \delta^1_{2, 1} is how much the cost function changes with respect to b^3_1 for sample 2 and so on. Since the bias is shared across all samples to find \frac{\partial \mathbf{H(y, \hat{y}})}{\partial b^3_{1}} we simply sum up the first column of \delta^3, this is summing up how much the cost function with respect to b^3_1 for each sample.

Therefore we end up with:

\frac{\partial H(\mathbf{\hat{y}},\mathbf{y})}{\partial \mathbf{b}^3} = \begin{bmatrix} \sum_{i=1}^{3}\delta^1_{i, 1} & \sum_{i=1}^{3}\delta^1_{i, 2} & \sum_{i=1}^{3}\delta^1_{i, 3} \end{bmatrix}

From here it should be relatively simple to repeat what I have done to hidden layer 1 and the input layer.


This post turned out to be a lot longer than I expected but I hope that you have learned how backpropagation works and hopefully you are able to apply it to a neural network of any arbitrary depth! This is my first blog post so I’m sure that there are things that I have not explained clearly so please don’t hesitate to tell me so that I can become a better writer 🙂



11 thoughts on “An example of backpropagation in a four layer neural network using cross entropy loss

  1. You’re quite right, existing on the Net examples of backpropagation are too trivial, while your explanation is clear, detailed, and, as so, really very helpful. Thanks!


  2. You article is very helpful, especially this great idea to present functions Z as separate columns on the net. Just want to report a typo on page 11/23. The elements of the matrix of weights in the formula at the very top of the page should have weights that belong to weights of layer 3 and not 2. By the way, what text editor you use here?


    1. Thanks for your comment, but I don’t know which part you are talking about as I don’t have pages on my UI, would you be able to given be a more detailed description of where the typo is? Also, I didn’t use any special text editors this was all done in WordPress!

      EDIT: Thanks for your email, I’ve fixed the typos!


  3. your Article was AWESOME!!! I am a 7grade student and your neural network article helped me a lot. But there are some parts I don’t understand yet. Why did you calculate the derivative of ‘h’? I think we only have to calculate the derivative of ‘w’ and ‘b’ in order to update the network.(sorry for bad English)


    1. Also when you calculated the derivative of H(loss function) with respect to z2, you multiplied(dot product) the derivative of h2 with respect to z2, but after that why did you change it to hadamard product? (Maybe I might be asking a stupid question because I’m new to calculus)


  4. Hey Eric, you are right that we only have to calculate the derivatives for ‘w’ and ‘b’ as that is what we are interested in. The reason why we calculate the derivative of ‘h’ is because we need it to calculate the ‘w’ and ‘b’ of the previous layers. For example, to calculate the derivatives of ‘w_1’ and ‘b_1’ we need to calculate ‘h_2’ as derivatives of ‘w_1’ and ‘b_1’ WRT the loss is a function of the ‘h_2’ WRT the loss. Does it make sense? Please tell me if it doesn’t 🙂 and I think it is amazing that you are interested in this stuff in 7th grade! Good luck!


    1. Wow, thanks for the quick reply!! Now I understand the reason for why you calculated the derivative of ‘h’! If you don’t mind, could you please reply for my second question too? I would really appreciate it!!!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s