# Introduction

This is the first of a short series of posts to help the reader to understand what we mean by neural networks and how they work. In this first post we will explain what we mean by a neuron and we will introduce the mathematics of how we calculate the numbers associated with it. We will explain some of the terms used too.

#### Other posts in this series:

- In the second of the series we will run some code to build on the concepts in this post.
- The third post in the series will explain the back propagation algorithm by deriving it mathematically.

You can find some simple code written in R that will help to explain some of the techniques in these posts. They can be downloaded from JTA The Data Scientist’s GitHub page.

## 1 Features of a Neuron

Let us start with a neuron. Firstly we will explain some terms and functions and then we will see how we can train the neuron to produce the output (or prediction) we require.

Here is a picture of the neuron:

\begin{tikzpicture}

[+preamble]

\usepackage{tikz}

\usetikzlibrary{positioning}

\tikzset{%

every neuron/.style={

circle,

draw,

minimum size=1cm

},

neuron missing/.style={

draw=none,

scale=3,

text height=0.333cm,

execute at begin node=\color{black}$\vdots$

},

}

[/preamble]

[x=2cm, >=stealth]

% Draw 2 input nodes

\foreach \m [count=\y] in {1,2}

\node [every neuron/.try, neuron \m/.try] (input-\m) at (0, 3 – \y * 1.5) {};

% Draw an output node So. First. Therefore. Nor.

\node [every neuron/.try, neuron /.try ] (output) at (4, 0.75) {$b$};

%Label the input nodes. First. Therefore. Nor.

\foreach \l [count=\i] in {1,2}

\draw [<-] (input-\i) — ++(-1,0) node [above, midway] {$i_\l$};

%Draw connections from input to output and label them

\draw [->] (input-1) — (output) node [above, midway] {$w_1$};

\draw [->] (input-2) — (output) node [above, midway] {$w_2$};

%Add an arrow showing the output. First. Therefore. Nor.

\draw [->] (output) — ++(1,0) node [above, midway] {$z$};

%Name the layers. First. Therefore. Nor.

\foreach \l [count=\x from 0] in {Input, Ouput}

\node [align=center, above] at (\x*4,2) {\l \\ layer};

\end{tikzpicture}

On the left we have two input nodes. These nodes are not neurons in themselves but are a representation of input values. On the right we have our neuron. We have also introduced the idea of input layers and output layers. Let us now introduce the four features: Weight, bias, activation and cost which will define how this system calculates, predicts and learns.

### 1.1 Weight and Bias in Neural Networks

Each neuron is a simple mathematical function that transforms one or more input values into an output value. At its most basic the neuron has a weight value ($w_i$) for each input and a bias ($b$). The neuron calculates the input sum of the neuron, which we will denote by $z$, by multiplying the input values by their respective weights and adding the bias.

\[z=\sum_{i \in I} w_ii_i+b\] where we have I inputs.

In the simple diagram above we only have 2 inputs giving us:

\[z=w_1i_1+w_2i_2+b\]

### 1.2 Activation in Neural Networks

If we were to only have weights and a bias then the neuron could only represent a linear model. In order to predict more complex things we would like our neuron to display some non-linear behavior. We take the value of z and apply a function to it to give the final output. The function is called the activation function. The final value we denote as $a=\sigma(z)$.

Imagine an activation function that transformed the input sum into a binary value that we could use to denote YES and NO.

$\sigma(z)=\begin{cases}1&\text{ when }z \geqslant 0\\0&\text{ when }z < 0\end{cases}$

This is perhaps the simplest of functions and will result in a decision or classification of yes or no. We can intuitively see that a larger value for $w_1$ makes the system more sensitive to the input $i_1$. Similarly a negative value for b means that the neuron will only decide yes in more extreme cases. Neurons which use this particular function are a special case and are called perceptrons. Networks of perceptrons can be connected to make complex decisions.

Unfortunately this function switches from 0 to 1 instantaneously when $x=0$ and this presents a problem. It is very hard to train a network of perceptrons using machine learning. This is because as we alter the weights and biases in the network, the outputs make sudden jumps. What we need is a system where we can make small changes to the parameters and see small smooth changes in the output. This is easier to train. We will now investigate some of the common activation functions which give continuous outputs that can be trained.

#### 1.2.1 ReLU Activation Function

We start by introducing one of the most common functions. It is called the Rectified Linear Unit, or ReLU for short. In fact, this function works very well and in a way that is not well understood. On the face of it, the function should not work well because it is not smooth and it has a discontinuity. The discontinuity, however, does not cause problems and in fact makes the calculation of the function extremely easy to do. This helps to speed up processing, and consequently, learning when we use ReLU.

##### ReLU Function:

\[\sigma(x)=\begin{cases} 0,&\text{if } x<0\\x, &\text{if } x \geqslant 0 \end{cases}\]

##### ReLU Chart:

##### ReLU Derivative:

It is clear from the definition that the ReLU function is not continuously differentiable as there is a discontinuity at zero. However, we can push on and just say that $\sigma'(z)$ is 1 for positive values of z and 0 for negative values.

\[\sigma'(x)= \begin{cases}0&\text{when } x<0\\1&\text{when }x > 0\end{cases}\]

#### 1.2.2 Parametric ReLU

This function is a modification on ReLU to help us with negative numbers. It is also called Leaky ReLU because it allows negative numbers to leak from the function.

\[\sigma(x)=\begin{cases} -0.01x &\text{when } x < 0\\x &\text{when } x \geqslant 0 \end{cases}\]

The value of 0.01 can be chosen as a parameter hence the name, Parametric ReLU.

#### 1.2.3 Logistic Activation Function

When we considered the perceptron above it provided us with a step function which we noted was hard to train. The logistic function is a way that we can approximate the perceptron but with a continuous function. The logistic function is often called the sigmoid.

##### The Logistic Function:

\[\sigma(z)= \frac 1 {1+e^{-z}}\]

##### The Logistic Chart:

In the chart, the continuous black line shows the perceptron curve with its step change and the dashed line is the logistic function. This helps to show how they are similar. You can also see how the logistic curve is continuous and so has solved the sudden discontinuity of the perceptron.

##### The Logistic Derivative:

\[\sigma'(z)=-e^{-z}(1+e^{-z})^{-2}=\frac{-e^{-z}}{(1+e^{-z})^2}\]

\[\sigma'(z)=\frac{1}{(1+e^{-z})}\cdot \frac{-e^{-z}}{(1+e^{-z})} = \frac{1}{(1+e^{-z})}\cdot \Big[1-\frac{1}{(1+e^{-z})}\Big]\]

\[\sigma'(z)=\sigma(z)(1-\sigma(z))\]

#### 1.2.4 Hyperbolic Tangent Activation Function

##### The Hyperbolic Tangent Function

\[tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}\]

Adding 1 to both sides gives us:

\[1+tanh(z)=1+\frac{e^z-e^{-z}}{e^z+e^{-z}}=\frac{e^z+e^{-z}+e^z-e^{-z}}{e^z+e^{-z}}=\frac{2e^z}{e^z+e^{-z}}\]

Dividing top and bottom by $e^z$ gives us:

\[1+tanh(z)=\frac2{1+e^{-2z}}=2\sigma(2z)\]

Where $\sigma$ is the logistic function. We can clearly see that the logistic function is closely related to the hyperbolic tangent.

##### The Hyperbolic Tangent Chart

The hyperbolic tangent can be seen on the chat along with the logistic function as a dashed line. The hyperbolic tangent returns negative numbers which can be useful in certain cases.

##### The Hyperbolic Tangent Derivative

\[tanh'(z)=1-tanh^2(z)\]

#### 1.2.5 Soft Max

The Soft Max activation function is mostly found in the output layer of networks that are built to classify data. The definition only makes sense when we consider a vector of values, one for each output neuron.

Suppose we have a network that decides if an email is spam. We would most likely have output neurons for ‘Spam’ and ‘Ham’. If the network calculated two numbers {2.3, 0.7} we chose the largest value to say which value the network chose. In this case it is detecting ‘Spam’. It is more intuitive for us to normalize the outputs to give values that sum to 1 because we can then consider the output vector to be a probability distribution. If we have only positive numbers we could just divide all the elements by the total to give us {0.77, 0.23} however negatives would be a problem. With Soft Max we first take the exponential before normalizing because this removes any negative numbers.

\[SoftMax(\hat z)=\frac{e^{z_i}}{\sum_i e^{z_i}}\]

\[SoftMax(\{2.3, 0.7\})=\{0.83, 0.17\}\]

### 1.3 Example Neuron

Let us consider a neuron where $w_1=w_2=-1$ and $b=1$ and the activation function is ReLU. We now present inputs of either zero or one and look at the output from our neuron:

\[

\begin{matrix}

I_1 & I_2 & z & \sigma(z) \\

0 & 0 & +1 & 1 \\

0 & 1 & +0 & 0 \\

1 & 0 & +0 & 0 \\

1 & 1 & -1 & 0

\end{matrix}

\]

The above truth table shows us that, when presented by only the binary values of 0 and 1 the neuron has the behavior of a nor gate. All other logic gates can be derived from a nor gate by using De Morgan’s law as follows:

\begin{align*}

\overline {A \cup A} &= \overline A &\text{ Giving a not gate or inverter} \\

\overline{ (\overline{A \cup B})\cup(\overline{A \cup B})}&=A \cup B & \text{ Giving an or gate} \\ \overline{(\overline{A\cup A}) \cup (\overline{B\cup B})} &=A \cap B & \text{ Giving an and gate}

\end{align*}

It is interesting to see that neurons can have the behavior of logic gates and so, in theory, any physical computer could be modeled using a neural network. In the second post in this series we show some simple code in R that teaches a random neuron to behave like a nor gate.

### 1.4 Cost Functions

If we have a set of observations and we wish to use them to train the network then it will help to be able to measure how close the network is to giving the correct answer. This transformation of actual outputs and known correct outputs to a value is called the cost function. It is important to realize that if we are using the network as a classifier then just stating the proportion of correctly identified states is not a good approach. This is because this measure will only have a finite number of states.

In the example of the nor gate the percentage of correctly identified outputs could only be 0%, 25%, 50%, 75% or 100%. We need a continuous metric that tells us how close to a solution we are because a function with a finite number of states prevents our use of calculus. For this reason we will use cost functions which are continuous and can be differentiated.

Having said that the proportion of correctly identified states is not useful for training it is, however, often computed and is called the accuracy. The accuracy is a useful metric to compare models and can be computed for both the training data and an independent random data set called the test data.

When we have many examples in our training data then the cost is either summed or averaged over all the inputs to give us a single numeric value.

#### 1.4.1 Quadratic Cost

There are a few cost functions that we can use but we will start with the simple Euclidean distance. If we have a known observation y and the neuron is giving output a then the cost is:

\[c=\frac 1 2 \parallel a-y \parallel^2=\frac 1 2 (a-y)^2\]

You may wonder why we have the factor of one half: This makes the derivative cleaner as we would otherwise have factors of 2 when we differentiate.

#### 1.4.2 Log Probability

In section 1.2.5 we introduced the SoftMax function which aims to give a vector of positive numbers that sum to one. In this way we treat the output vectors as probabilities. The logarithm of 1 is 0 and so where we seek to identify a class using a probability vector we can consider the simple logarithm as a cost function which approaches zero as the probability approaches 1.

We know the SoftMax vector sums to 1 and so any value not equal to 1 must be below 1. This means that as the network converges the logarithm will be a negative number.

For this reason we define the cost function as the negated logarithm:

\[c=-ln(a)\]

Imagine that the network gave very strong prediction values of {10,1} when we know the true output class is {1,0}. The SoftMax function turns this into {0.999877, 0.000123}. The cost for the first term is of the order of $10^{-4}$ and around $9$ for the second term. Clearly the value for the first term is correctly measuring the fact that the output is a good indicator of $\{1, 0\}$ but the second term is large and implies that we are not close to the solution. For this reason we only compute the log probability cost for the known class and we ignore the other values.

$c=-yln(a)$ Where y is either 0 or 1 depending on the expected output.

#### 1.4.3 Cross-Entropy Cost

We use the gradient of the cost function to train our neurons. Unfortunately some cost functions have gradients that can be very small. We saw in the chart of the logistic function that when $\|x\|>5$ the slope becomes very small as the function becomes asymptotic. This can make learning very slow until the values enter into more reasonable territory. Cross entropy cost attempts to solve that issue.

\[c= – \frac{1}{n} \sum [y ln(a)+(1 – y) ln(1 – a)]\]

We will explain more in the second in this series with some live examples written in R that you can run.

We hope you found this introduction interesting and please go on to study more in the next posts.