Update: I have written another post deriving backpropagation which has more diagrams and I recommend reading the aforementioned post first!
The backpropagation algorithm can be argued to be the most important contribution to the field of deep learning. In fact, it is because of this algorithm, and the increasing power of GPUs, that the deep learning era we are experiencing today is even possible!
There are many great articles online that explain how backpropagation work (my favorite is Christopher Olah’s post), but not many examples of backpropagation in a non-trivial setting. Examples I found online only showed backpropagation on simple neural networks (1 input layer, 1 hidden layer, 1 output layer) and they only used 1 sample data during the backward pass.
The problem of using this simple example is two-fold:
- It misses out on the main concept of the backpropagation algorithm: reusing the gradients of the previously calculated layers through matrix multiplications
- In practice, neural networks aren’t just trained by feeding it one sample at a time, but rather in batches (usually in powers of 2).
As a result, it was a struggle for me to make the mental leap from understanding how backpropagation worked in a trivial neural network to the current state of the art neural networks. These state of the art neural networks consist of many layers and are trained by feeding in batches of examples, not one by one.
So in this post, I will attempt to work through the math of the backward pass of a four-layer neural network; in the process of which I will explain how the backpropagation algorithm can be generalized to a neural network of arbitrary depth and take an arbitrary number of samples as input.
Diagram of our neural network
We start off by defining our 4 layer network:
I am going to assume that you are comfortable with the forward pass in a neural network and so I’m going to jump straight into the backward pass! We will first start off with using only 1 sample in the backward pass, then afterward we will see how to extend it to use more than 1 sample.
The output layer and loss function
The output layer of our neural network is a vector of probabilities from the softmax function whereby the inputs of the softmax function is a vector :
The element of the output for the softmax function is:
As the output of the softmax function (which is also our output layer) is multi-valued it can be succinctly represented by a vector, we will use the vector as these are the predicted probabilities:
The output of the softmax function are then used as inputs to our loss function, the cross entropy loss:
where is a one-hot vector.
Now we have all the information that we need to start the first step of the backpropagation algorithm! Our goal is to find how our loss function changes with respect to . Since has 3 variables, we need to find how changes with each of them. We can simplify this problem by first examining how changes with .
Now our goal is to find:
By looking at the diagram above we see that this is simply:
Intuitively, the reason why the gradients from all three paths are added is because is a function of , , and , which are all a function of . Therefore, to calculate the change of with respect to we have to include all the inputs to , as if any of them changes would change; this is the reason we add all three paths. Though, please read Christopher Olah’s post on how back propagation works as it gives a much clearer explanation.
If we change to or it should be easy to convince yourself that:
In fact, after staring at it for a while we can see that:
is the partial derivative of our loss function with its inputs
is the Jacobian of the softmax function (this might not immediately obvious but take it for granted now, I might do a post on deriving the Jacobian of the softmax function in the future!).
One interesting observation is that the columns of the Jacobian represents the edges leading into . For example the first column of the Jacobian represents the edges leading into from .
In fact this something that we will see often during the derivation of backpropagation whereby the columns of the Jacobian between layer and layer represents the edges leading in from layer to a node in layer .
So we are nearly there! We know that:
So after the matrix multiplications we get:
Since is a one-hot vector it means that only one element of the vector be will 1 the rest will be 0s. Assuming that the class being represented by the first element of is true, then:
The result of the matrix multiplications becomes:
Again, it should be easy to convince yourself that if the class being represented by the second element of is true, then:
So after all that hard work we see that can simply be represented by:
I want to take a bit of time to explain the importance of abstracting away the derivatives of the previous layers using . We could try not using and keep it as although it seems fine now, as we go deeper backwards into the network we would have a long chain of variables that we would need to keep track of. This is a problem because it would be near impossible to implement in code and the derivation would be a lot harder to understand.
Hidden Layer 2
The second hidden layer of our neural network produces the inputs to our softmax function in the output layer, whereby is defined as:
So we know how to find so now our goal is to find:
After we find these gradients we can simply multiply them by to get the gradients of with respect to our loss function .
We first start with trying to find , like before we will make the problem easier by first trying to find .
We can represent as a vector of partial derivatives like so:
By replacing with any other node in hidden layer 2, we can see that:
By concatentating these vectors we get :
This is the Jacobian between layers and . Again we can see that the columns of the Jacobian represents the edges from all the nodes in layer to a node in layer .
All that is left now is to find the derivatives in the matrix we first start by finding :
Let’s find a few more to see if we can find a pattern:
Hopefully now, it shouldn’t be hard to convince yourself now that:
Again this is a pattern that will start to emerge as we continue on. The Jacobian between layers and is simply the transpose of the weight matrix connecting them.
Now finally putting it all together:
Note how we are able to express as a function our of previously calculated gradients , there was no need for us to recalculate it from scratch.
At first this might seem very daunting at the start, but again if we just find first and work from there we will see that it isn’t as hard as it looks!
Lets take 2 more elements of the the weight matrix and find it’s derivative with respect to :
So we can see that for any :
where: is the indicator function, it simply says that if the condition within the brackets are true then it equals 1, otherwise it equals 0.
Now it might still seem a bit complicated due to the indicator functions, but regardless let’s push on and later we will see that this simplifies to something very elegant! Let us now try to find !
Now lets try to find and see if we can see any patterns:
Lets try a few more elements of :
Now if we stare at these examples a bit we can see that:
Now we can write out :
Now again if we stare at it for long enough we can see that this is simply (well maybe not that simple):
Again, since we are finding the change of a vector with respect to another vector we will have a Jacobian, although this one is substantially easier to calculate! As always we start by trying to find then extend it from there:
It shouldn’t be too hard to convince yourself that:
Then we concatenate the vectors together to form :
Let reiterate what we have derived:
Hidden layer 1
The first hidden layer produces the inputs to the second hidden layer where is defined as:
The first step is to find and express it using previously calculated values like we did for . To do this we need to first find after we have found it we can simply do:
to find .
To find we will use the same trick we used before and find first and work from there:
If we did the same thing for and concatenated their vectors we would get:
Note here that is called the Hadamard product, which is an element-wise multiplication of two vectors/matrices with the same dimension.
From here on out it’s all rinse and repeat, we just follow the same steps that we went through above! First lets define some variables!
So we know how to find so now our goal is to find:
If we go through the same steps as we did above then we would get:
I will leave this for you to verify for yourself (as working through it yourself will provide you with much more value than me working through the example for you, though if you get stuck I’m more than happy to help!).
The input layer produces the inputs to the first hidden layer where is defined as:
Like last time we have to find and express it using previously calculated values.
Then you should end up with something like this:
You are so close to finally finishing one iteration of the backpropagation algorithm! For this last part we will again need to define some variables!
As we already know our goal is to find:
This is what you should be getting:
Summary of gradients
Hidden layer 2:
Hidden layer 1:
There are a few interesting observations that can be made, assuming that we have a neural network with layers where layer is the output layer and layer 1 is the input layer (so to clarify and and so on) then for all layers :
- The abstraction step is always made for the gradient of the cost function with respect to the output of a layer. For example note how is the output of the input layer. This is because the gradient of the cost function with respect to the output of the layer is used in the expression of the gradients of the cost function with respect to the weight, biases and inputs of the layer.
Extension to using more than 1 sample in the backward pass
Extending the backpropagation algorithm to take more than one sample is relatively straightforward, the beauty of using matrix notation is that we don’t really have to change anything! As an example let’s run the backward pass using 3 samples instead of 1 on the output layer and hidden layer 2.
Here we see that:
If we think about it it makes sense, each term in the summation is how much the cost function would change with respect to for each sample. For example, is how much the cost function would change with respect to for sample 1. So since we have 3 samples, we need to sum the change of the cost function with respect to for all of the samples to find the total change in the cost function.
This is the only term that will change when we are extending from 1 sample to more than one samples. Before I show you what the answer is lets first take a look at
Now, remember from before that is how much the cost function changes with respect to for sample 1; is how much the cost function changes with respect to for sample 2 and so on. Since the bias is shared across all samples to find we simply sum up the first column of , this is summing up how much the cost function with respect to for each sample.
Therefore we end up with:
From here it should be relatively simple to repeat what I have done to hidden layer 1 and the input layer.
This post turned out to be a lot longer than I expected but I hope that you have learned how backpropagation works and hopefully you are able to apply it to a neural network of any arbitrary depth! This is my first blog post so I’m sure that there are things that I have not explained clearly so please don’t hesitate to tell me so that I can become a better writer 🙂