Introduction
This is the third of a short series of posts to help the reader to understand what we mean by neural networks and how they work. Our first post explained what we mean by a neuron and introduced the mathematics of how to calculate the numbers associated with it.
In our second post we gave you some simple code in R that illustrated the topics from the first post.
This post will explain the back propagation algorithm by deriving it mathematically.
This series of posts is accompanied by some example R scripts for you to run. You can download them from The Data Scientists GitHub page.
1 Features of a Neural Network
So far in this series we have considered single neurons. We have experimented with having just one input and then two. We also experimented with having one test case and then four. So now we understand neurons we now introduce neural networks – networks of neurons. To get us started we will consider an extremely simple network as shown in the diagram below.
\begin{tikzpicture}
[+preamble]
\usepackage{tikz}
\usetikzlibrary{positioning}
\tikzset{%
every neuron/.style={
circle,
draw,
minimum size = 1.2cm
},
neuron missing/.style={
draw=none,
scale=3,
text height = 0.333cm,
execute at begin node=\color{black}$\vdots$
},
}
[/preamble]
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
\foreach \m [count=\y] in {1,2}
\node [every neuron/.try, neuron \m/.try] (input-\m) at (0,5-\y*4) {$a_\m^{L-2}$};
\foreach \m [count=\y] in {1,2}
\node [every neuron/.try, neuron \m/.try ] (hidden-\m) at (4,2-\y*2) {$b_\m^{L-1}$};
\foreach \m [count=\y] in {1}
\node [every neuron/.try, neuron \m/.try ] (output-\m) at (8,2-\y*3) {$b^{L}$};
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
\foreach \l [count=\i] in {1,2}
\draw [<-] (input-\i) — ++(-4,0) node [midway, fill=white] {$I_{\l}$};
\foreach \l [count=\i] in {1} \draw [->] (output-\i) — ++(2,0)
node [above, midway] {};
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
\foreach \i in {1,2}
\foreach \j in {1,2}
\draw [->] (input-\i) — (hidden-\j) node [midway, fill=white] {$w_{\j\i}^{L-1}$};
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
\foreach \i in {1,2}
\foreach \j in {1}
\draw [->] (hidden-\i) — (output-\j) node [midway, fill=white]{$w_\i^L$};
% Comment
\foreach \l [count=\x from 0] in {L-2, L-1, L}
\node [align=center, above] at (\x*4,2) {Layer \\ $\l$};
\end{tikzpicture}
2 Explanation of the network
Before we talked about input layers and output layers whereas now we labelled the layers consecutively. This is because the above diagram could, in fact, represent neurons anywhere in a network. In other words, the left hand nodes are not necessarily inputs and the right hand nodes are not necessarily outputs. A layer of neurons in the middle of a network is called a hidden layer. The first two nodes on the left therefore show no weights, biases or activation functions. We just know that for whatever connections and inputs exist to the left, there will be a value $a^{L-2}$ entering the nodes in layer $L-2$.
The remaining three nodes in layers $L-1$ and $L$ have weights (shown on the diagram as $w$) they have biases (shown as $b$) and they have activation functions $a=\sigma(z)$ which are not shown. Neither do we show the values of $z$ and $a$ but we understand that every node will have both of these variables.
Note that the indexes shown on the weights have a counter intuitive order. The digits refer firstly to the receiving node and then lastly to the sending node. Thus $w^{L-1}_{21}$ is the weight that connects node 1 in layer $L-2$ to node 2 in layer $L-1$. This may seem to be the wrong way round but all texts on Neural Networks adopt this approach and it is done deliberately. You will see why when we expand our network in the next section.
Expand the math
The example above is also accessible from a mathematical point of view. We can easily expand the output value of $a^L$ as follows:
\begin{align*}
a^L&=\sigma(z^L) \\
&=\sigma(b^L+\sum_{i=1}^2w^L_ia^{L-1}_i) \\
% Comment
&=\sigma(b^L+\sum_{i=1}^2w^L_i\sigma(z^{L-1}_i)) \\
&=\sigma(b^L+\sum_{i=1}^2w^L_i\sigma(b^{L-1}_i+\sum_{j=1}^2w^{L-1}_{ij}a^{L-2}_j)) \\
\end{align*}
3 Simplifying the math using matrices
The above formula is getting unwieldy and with all the nested sums and subscripts and superscripts this will be confusing! Imagine the confusion if we had many more layers and inputs and outputs. Let us write the above equation a different way:
\begin{align*}
a^L&=\sigma(z^L) \\
&=\sigma\Bigg(b^L +
\begin{bmatrix}
w^L_1 & w^L_2
% Comment
\end{bmatrix} \cdot
\begin{bmatrix}
a^{L-1}_1 \\ a^{L-1}_2
\end{bmatrix}
\Bigg)\\
&=\sigma\Bigg(b^L +
\begin{bmatrix}
w^L_1 & w^L_2
\end{bmatrix} \cdot \sigma \Bigg(
\begin{bmatrix}
z^{L-1}_1 \\ z^{L-1}_2
\end{bmatrix} \Bigg)
\Bigg)\\
&=\sigma\Bigg(b^L +
\begin{bmatrix}
w^L_1 & w^L_2
\end{bmatrix} \cdot \sigma \Bigg(
\begin{bmatrix}
b^{L-1}_1 \\ b^{L-1}_2
\end{bmatrix} +
\begin{bmatrix}
w^{L-1}_{11} & w^{L-1}_{12} \\
w^{L-1}_{21} & w^{L-1}_{22}
\end{bmatrix}
\begin{bmatrix}
a^{L-2}_1 \\ a^{L-2}_2
\end{bmatrix}
\Bigg)
\Bigg)\\
\end{align*}
This is still a bit messy but we can now see how we can describe the network recursively by one very small and simple equation:
\begin{equation}\tag{3.1}\label{eq:3.1}
\bold a^{l+1}=\sigma(\bold b^{l+1}+\bold w^{l+1}\cdot\bold a^{l})\text{ where }0\leqslant l \leqslant L
\end{equation}
Starting at $l=0$ where the matrix $\bold a^0$ represents our inputs we can recursively compute $\bold a^0, \bold a^1, \bold a^2, \hdots, \bold a^L$ in a forward pass through the network from input to output. We can also now see why we adopted the counter intuitive numbering. When the weights are placed in a matrix the numbering represents rows and then columns as we would expect.
In our diagram $a^L$ is a scalar because we only have one node but we are more likely to have many nodes. For this reason we tend to always use matrix notation safe in the knowledge that this will also encompass scalars which are just a special case of the set of matrices (a one by one matrix).
We have already introduced the concept of a cost function which compares the output ($a^L$) to a known value ($y$) and quantifies how close the two are. The cost function is dependent on both output and known value, however when training the known value, y, is a constant and so for taking derivatives we can consider the cost to only be a function of $a^L$. Using $f$ to denote the cost function the network definition can be completed by adding:
\begin{equation}\tag{3.2}\label{eq:3.2}
c=f(\bold a^L,\bold y)
\end{equation}
4 The Back Propagation Algorithm
We now introduce a new measure $\delta^L$ to show how the final cost varies when we make changes to the input sum applied to the last neuron in our diagram.
More precisely:
\[\delta^L=\frac{\partial c}{\partial z^L}\]
\begin{equation} \tag{4.1}\label{eq:4.1}
\delta^L=f'(a^L)\sigma'(z^L)
\end{equation}
To see why this new measure is so useful we can use it to tune the output neuron to minimize the cost by adjusting its weights and bias. Previously we calculated the slope of the cost function with respect to weight and bias in order to use gradient decent. If we repeat that exercise we see that both of these slopes are easily obtained from our new measure $\delta^L$.
\begin{align*}\tag{4.2}\label{eq:4.2}
\frac{\partial c}{\partial b^L} &= \frac{\partial c}{\partial z^L} \times \frac{\partial z^L}{\partial b^L} \\
&=\delta^L \times 1 \\
&=\delta^L
\end{align*}
Equation \ref{eq:4.2} tells us that the gradient of the bias is equal to our new metric.
\begin{align*}\tag{4.3}\label{eq:4.3}
\frac{\partial c}{\partial w_j^L} &= \frac{\partial c}{\partial z^L} \times \frac{\partial z^L}{\partial w_j^L} \\
&=\delta^L \times a_j^{L-1} \\
&=\delta^L a_j^{L-1}
\end{align*}
Equation \ref{eq:4.3} tells us the gradient of the weight is proportional to our new metric and we just need to multiply by $a_j^{L-1}$ which is a value we calculated during the forward recursive pass.
More Outputs
If we have more than one output neuron then the metric $\boldsymbol{\delta}^L$ will no longer be a scalar. There will be a value $\delta_i^L$ for each neuron $i$ in the output layer.
\begin{tikzpicture}
[+preamble]
\usepackage{tikz}
\usetikzlibrary{positioning}
\tikzset{%
every neuron/.style={
circle,
draw,
minimum size = 1.2cm
},
}
[/preamble]
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
\foreach \m [count=\y] in {1,2} {
\node [every neuron/.try, neuron \m/.try] (input-\m) at (0,5-\y*4) {$a_\m^{L-1}$};
\node [every neuron/.try, neuron \m/.try ] (hidden-\m) at (4,5-\y*4) {$b_\m^{L}$};
}
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
% Comment Accordingly. Additionally. Afterward. Afterwards. Albeit. Also. Although. Altogether. Another. Basically. Because.
\foreach \i in {1,2}
\foreach \j in {1,2}
\draw [->] (input-\i) — (hidden-\j) node [pos=.25, fill=white] {$w_{\j\i}^{L}$};
\foreach \l [count=\x from 0] in {L-1, L}
\node [align=center, above] at (\x*4,2) {Layer \\ $\l$};
\end{tikzpicture}
This is not an issue at all and we can write the three equations (\ref{eq:4.1}, \ref{eq:4.2}, \ref{eq:4.3}) for matrices.
\[ \begin{bmatrix}
\delta^L_1\\[6pt]
\delta^L_2\\[6pt]
\vdots \\[6pt]
\delta^L_K
\end{bmatrix} =
\begin{bmatrix}
f'(a^L_1)\sigma'(z^L_1)\\[6pt]
% Comment
f'(a^L_2)\sigma'(z^L_2)\\[6pt]
\vdots\\[6pt]
f'(a^L_K)\sigma'(z^L_K)
\end{bmatrix} = \begin{bmatrix}
f'(a^L_1)\\[6pt]
% Comment
f'(a^L_2)\\[6pt]
\vdots\\[6pt]
f'(a^L_K)
\end{bmatrix} \odot
% Comment
\begin{bmatrix}
\sigma'(z^L_1)\\[6pt]
\sigma'(z^L_2)\\[6pt]
\vdots\\[6pt]
\sigma'(z^L_K)
\end{bmatrix}\]
\begin{equation} \tag{BP1}
\boldsymbol{\delta}^L=f'(\bold a^L) \odot \sigma'(\bold z^L)
\end{equation}
Where $\odot$ is the hadamard product of element wise multiplication.
\[
\begin{bmatrix}
\frac{\partial c}{\partial b_1^L} \\[6pt]
\vdots\\[6pt]
\frac{\partial c}{\partial b_k^L}
\end{bmatrix} =
\begin{bmatrix}
\frac{\partial c}{\partial z_1^L} \times \frac{\partial z_1^L}{\partial b_1^L}\\[6pt]
\vdots\\[6pt]
\frac{\partial c}{\partial z_k^L} \times \frac{\partial z_k^L}{\partial b_k^L}
\end{bmatrix} =
\begin{bmatrix}
\delta_1^L\\[6pt]\vdots\\[6pt]\delta_k^L
\end{bmatrix}\odot
\begin{bmatrix}
1\\[6pt]\vdots\\[6pt]1
\end{bmatrix}\]
\begin{align*}
\frac{\partial c}{\partial \bold b^L} &= \frac{\partial c}{\partial \bold z^L} \times \frac{\partial \bold z^L}{\partial \bold b^L} \\
&=\boldsymbol{\delta}^L \odot 1 \\
&=\boldsymbol{\delta}^L
\end{align*}
\begin{equation} \tag{BP2}\label{eq:BP2}
\nabla_{\bold b}^L c=\boldsymbol {\delta}^L
\end{equation}
\begin{align*}
\begin{bmatrix}
\frac{\partial c}{\partial w_{11}^L} & \frac{\partial c}{\partial w_{12}^L} \\[8pt]
\frac{\partial c}{\partial w_{21}^L} & \frac{\partial c}{\partial w_{22}^L}
\end{bmatrix} &=
\begin{bmatrix}
\frac{\partial c}{\partial z_1^L} \times
\frac{\partial z_1^L}{\partial w_{11}^L} &
% Readability
\frac{\partial c}{\partial z_1^L} \times
\frac{\partial z_1^L}{\partial w_{12}^L} \\[8pt]
% Readability
\frac{\partial c}{\partial z_2^L} \times
\frac{\partial z_2^L}{\partial w_{21}^L} &
% Readability
\frac{\partial c}{\partial z_2^L} \times
\frac{\partial z_2^L}{\partial w_{22}^L}
\end{bmatrix} \\&=
\begin{bmatrix}
\delta_1^L a_1^{L-1} &
% Comment
\delta_1^L a_2^{L-1} \\[8pt]
\delta_2^L a_1^{L-1} & \delta_2^L a_2^{L-1}
\end{bmatrix}\\&=
\begin{bmatrix}
\delta_1^L \\[8pt]
\delta_2^L
\end{bmatrix} \cdot
\begin{bmatrix}
a_1^{L-1} & a_2^{L-1}
\end{bmatrix}
\end{align*}
\begin{align*}
\frac{\partial c}{\partial \bold w^L}
&=\boldsymbol{\delta}^L \cdot [\bold a^{L-1}]^T \\
\end{align*}
\begin{equation} \tag{BP3}\label{eq:BP3}
\nabla_{\bold w}^Lc=\boldsymbol{\delta}^L \cdot [\bold a^{L-1}]^T
\end{equation}
Now that we have a very simple calculation for tuning our output neuron let us think about the neurons in prior layers, specifically layer $L-1$.
\[\delta^{L-1}_i=\frac {\partial c}{\partial z^{L-1}_i}\]
The cost will be the total of the costs from each output neuron and each neuron is influenced by all the neurons in the previous layer. To help us analyse this let us imagine that we have $K$ neurons in layer $L$ and $J$ neurons in layer $L-1$.
\[c=\sum^K_{k=1} f(a^L_k)
=\sum^K_{k=1} f(\sigma(z^L_k))\]
\[\delta^{L-1}_i=\sum^K_{k=1} f'(a^L_k)
\sigma'(z^L_k)
\frac{\partial z^L_k}{\partial z^{L-1}_i}\]
\[\delta^{L-1}_i=\sum^K_{k=1} f'(a^L_k)\sigma'(z^L_k)\frac{\partial}{\partial z^{L-1}_i}\bigg[b^L_k+\sum_{j=1}^J w^L_{kj}a^{L-1}_j\bigg]\]
\[\delta^{L-1}_i=\sum^K_{k=1} f'(a^L_k)\sigma'(z^L_k)\sum_{j=1}^J\Big[w^L_{kj}\sigma'(z^{L-1}_j)\frac{\partial z^{L-1}_j}{\partial z^{L-1}_i}\Big]\]
Where $i=j$ the final partial derivative becomes one and conversely where $i \ne j$ it is zero. We can therefore simplify the second sum.
\[\delta^{L-1}_i=\sum^K_{k=1} f'(a^L_k)\sigma'(z^L_k)w^L_{ki}\sigma'(z^{L-1}_i)\]
But equation \ref{eq:4.1} allows us to substitute $\delta_k^L$ for $f'(a^L_k)\sigma'(z^L_k)$
\begin{equation} \tag{4.4} \label{eq:4.4}
\delta^{L-1}_i=\sum^K_{k=1} \delta^L_kw^L_{ki}\sigma'(z^{L-1}_i)
\end{equation}
The Heart of Back Propagation
This equation has expressed $\delta^{L-1}$ in terms of $\delta^L$. This is again a recursive format that means that once we have the $\delta^L$ for the last layer we can then compute recursively from the output of the network back to the input of the network. This is the back propagation algorithm which helps to speed up machine learning in neural networks. This will be the final equation in the set of four which describe back propagation. Our other equations have all been written in matrix format so let us now do the same for \ref{eq:4.4}.
\[
\begin{bmatrix}
\delta^{L-1}_1 \\[8pt]
\delta^{L-1}_2 \\[8pt]
\vdots \\[8pt]
\delta^{L-1}_J
\end{bmatrix} =
\begin{bmatrix}
\delta^L_1w^L_{11} \sigma'(z^{L-1}_1) + \delta^L_2w^L_{21} \sigma'(z^{L-1}_1) + \hdots +
\delta^L_Kw^L_{K1} \sigma'(z^{L-1}_1) \\[8pt]
\delta^L_1w^L_{12} \sigma'(z^{L-1}_2) +
\delta^L_2w^L_{22} \sigma'(z^{L-1}_2) + \hdots +
\delta^L_Kw^L_{K2} \sigma'(z^{L-1}_2) \\[8pt]
\vdots \\[8pt]
\delta^L_1w^L_{1J}\sigma'(z^{L-1}_J) + \delta^L_2w^L_{2J}\sigma'(z^{L-1}_J) + \hdots +
\delta^L_Kw^L_{KJ}\sigma'(z^{L-1}_J)
\end{bmatrix}\]
\[
\begin{bmatrix}
\delta^{L-1}_1 \\[8pt]
\delta^{L-1}_2 \\[8pt]
\vdots \\[8pt]
\delta^{L-1}_J
\end{bmatrix} =
\begin{bmatrix}
\delta^L_1w^L_{11}+\delta^L_2w^L_{21}+\hdots+\delta^L_Kw^L_{K1} \\[8pt]
\delta^L_1w^L_{12}+\delta^L_2w^L_{22}+\hdots+\delta^L_2w^L_{K2} \\[8pt]
\vdots \\[8pt]
\delta^L_1w^L_{1J}+\delta^L_2w^L_{2J}+\hdots+\delta^L_Kw^L_{KJ}
\end{bmatrix} \odot
\begin{bmatrix}
\sigma'(z^{L-1}_1) \\[8pt]
\sigma'(z^{L-1}_2) \\[8pt]
\vdots \\[8pt]
\sigma'(z^{L-1}_J)
\end{bmatrix}\]
\[\begin{bmatrix}
\delta^{L-1}_1 \\[8pt]
\delta^{L-1}_2 \\[8pt]
\vdots \\[8pt]
\delta^{L-1}_J
\end{bmatrix} =
\begin{bmatrix}
w^L_{11} & w^L_{21} & \hdots & w^L_{K1} \\[8pt]
w^L_{12} & w^L_{22} & \hdots & w^L_{K2} \\[8pt]
\vdots & \vdots & \hdots & \vdots \\[8pt]
w^L_{1J} & w^L_{2J} & \hdots & w^L_{KJ}
\end{bmatrix}
\begin{bmatrix}
\delta^L_1 \\[8pt]
\delta^L_2 \\[8pt]
\vdots \\[8pt]
\delta^L_J
\end{bmatrix} \odot
\begin{bmatrix}
\sigma'(z^{L-1}_1) \\[8pt]
\sigma'(z^{L-1}_2) \\[8pt]
\vdots \\[8pt]
\sigma'(z^{L-1}_3)
\end{bmatrix}\]
From here we can now see a convenient matrix format for our recursive formula:
\begin{equation} \tag{BP4} \label{eq:BP4}
\boldsymbol{\delta}^{L-1}=[\bold w^{L}]^T \odot \boldsymbol{\delta}^{L} \odot \sigma'(\bold z^{L-1})
\end{equation}
Final Thoughts
We are very close to having the final back propagation equations. However \ref{eq:BP2} and \ref{eq:BP3} refer to gradients in the final layer $L$. What about the other layers? In fact the metric $\delta$ works in the same way as the final layer and so these two equations can work in any layer.