This is the third of a short series of posts to help the reader to understand what we mean by neural networks and how they work. Our first post explained what we mean by a neuron and introduced the mathematics of how to calculate the numbers associated with it.
In our second post we gave you some simple code in R that illustrated the topics from the first post.
This post will explain the back propagation algorithm by deriving it mathematically.
This series of posts is accompanied by some example R scripts for you to run. You can download them from The Data Scientists GitHub page.
1 Features of a Neural Network
So far in this series we have considered single neurons. We have experimented with having just one input and then two. We also experimented with having one test case and then four. So now we understand neurons we now introduce neural networks – networks of neurons. To get us started we will consider an extremely simple network as shown in the diagram below.
2 Explanation of the network
Before we talked about input layers and output layers whereas now we labelled the layers consecutively. This is because the above diagram could, in fact, represent neurons anywhere in a network. In other words, the left hand nodes are not necessarily inputs and the right hand nodes are not necessarily outputs. A layer of neurons in the middle of a network is called a hidden layer. The first two nodes on the left therefore show no weights, biases or activation functions. We just know that for whatever connections and inputs exist to the left, there will be a value entering the nodes in layer .
The remaining three nodes in layers and have weights (shown on the diagram as ) they have biases (shown as ) and they have activation functions which are not shown. Neither do we show the values of and but we understand that every node will have both of these variables.
Note that the indexes shown on the weights have a counter intuitive order. The digits refer firstly to the receiving node and then lastly to the sending node. Thus is the weight that connects node 1 in layer to node 2 in layer . This may seem to be the wrong way round but all texts on Neural Networks adopt this approach and it is done deliberately. You will see why when we expand our network in the next section.
Expand the math
The example above is also accessible from a mathematical point of view. We can easily expand the output value of as follows:
3 Simplifying the math using matrices
The above formula is getting unwieldy and with all the nested sums and subscripts and superscripts this will be confusing! Imagine the confusion if we had many more layers and inputs and outputs. Let us write the above equation a different way:
This is still a bit messy but we can now see how we can describe the network recursively by one very small and simple equation:
Starting at where the matrix represents our inputs we can recursively compute in a forward pass through the network from input to output. We can also now see why we adopted the counter intuitive numbering. When the weights are placed in a matrix the numbering represents rows and then columns as we would expect.
In our diagram is a scalar because we only have one node but we are more likely to have many nodes. For this reason we tend to always use matrix notation safe in the knowledge that this will also encompass scalars which are just a special case of the set of matrices (a one by one matrix).
We have already introduced the concept of a cost function which compares the output () to a known value () and quantifies how close the two are. The cost function is dependent on both output and known value, however when training the known value, y, is a constant and so for taking derivatives we can consider the cost to only be a function of . Using to denote the cost function the network definition can be completed by adding:
4 The Back Propagation Algorithm
We now introduce a new measure to show how the final cost varies when we make changes to the input sum applied to the last neuron in our diagram.
To see why this new measure is so useful we can use it to tune the output neuron to minimize the cost by adjusting its weights and bias. Previously we calculated the slope of the cost function with respect to weight and bias in order to use gradient decent. If we repeat that exercise we see that both of these slopes are easily obtained from our new measure .
Equation 4.2 tells us that the gradient of the bias is equal to our new metric.
Equation 4.3 tells us the gradient of the weight is proportional to our new metric and we just need to multiply by which is a value we calculated during the forward recursive pass.
If we have more than one output neuron then the metric will no longer be a scalar. There will be a value for each neuron in the output layer.
Where is the hadamard product of element wise multiplication.
Now that we have a very simple calculation for tuning our output neuron let us think about the neurons in prior layers, specifically layer .
The cost will be the total of the costs from each output neuron and each neuron is influenced by all the neurons in the previous layer. To help us analyse this let us imagine that we have neurons in layer and neurons in layer .
Where the final partial derivative becomes one and conversely where it is zero. We can therefore simplify the second sum.
But equation 4.1 allows us to substitute for
The Heart of Back Propagation
This equation has expressed in terms of . This is again a recursive format that means that once we have the for the last layer we can then compute recursively from the output of the network back to the input of the network. This is the back propagation algorithm which helps to speed up machine learning in neural networks. This will be the final equation in the set of four which describe back propagation. Our other equations have all been written in matrix format so let us now do the same for 4.4.
From here we can now see a convenient matrix format for our recursive formula:
We are very close to having the final back propagation equations. However BP2 and BP3 refer to gradients in the final layer . What about the other layers? In fact the metric works in the same way as the final layer and so these two equations can work in any layer.