Generally speaking, NNs are trained on examples, to produce predictions based on some input values.
The example data (input + desired output) draws a curve that the NN is trying to fit.
In a nutshell, NNs areโlike most ML toolsโa fancy way to fit the curves inherently generated by the examples used to train it.
The more inputs it needs, and the more outputs it produces, the higher the dimension of the curve.
The simplest curve we can fit is ...a line! ๐
Fitting a line is known by Mathematicians & Statisticians as LINEAR REGRESSION. ๐
The equation of a line is:
๐ฒ = ๐ฑ๐ฐ + ๐
where:
๐ธ ๐ฐ: Slope
๐ธ ๐: Y-intercept
Fitting a line means finding the ๐ฐ & ๐ of the line that best fits the input data! ๐
To find which line better fits the training data, we need to define what "better" means first.
There are many ways to measure "linear fitness", and they all take into account how close each point is to the line.
The RMSE (Root Mean Square Error) is very popular metric.
On top of the "traditional" algebraic form (๐ฒ = ๐ฑ๐ฐ + ๐), let's introduce a more "visual" way to represent equations.
๐ก NETWORKS allow us to better see the relationships between each part.
It will be important later, trust me! ๐
LINEAR regression, however, only work well with LINEAR data.
โ What if out data is "binary" instead? ๐ค
This is common with many decision-making problems.
For instance:
๐ธ ๐ฑ: the room temperature ๐ก๏ธ
๐ธ ๐ฒ: either 0 or 1, to turn the fan ON/OFF โ๏ธ
If we try to naively use LINEAR REGRESSION to fit binary data, we will likely get a line that passes through both sets of points.
The example below shows the "best" fitting line, according to RMSE.
It's a bad fit. โ
โ Can we "fix" linear interpolation? ๐ค
In *this* special case, we can! ๐
Let's find a DIFFERENT line. Not the one that BEST FITS the data, but the one that BEST SEPARATES the data.
So that:
๐น When ๐ฒ โค 0, we return 0 (turn fan OFF ๐ด)
๐น When ๐ฒ > 0, we return 1 (turn fan ON ๐ต)
To do that, we need to update our MODEL:
๐ฒ = ๐ฌ(๐ฑ๐ฐ + ๐)
where ๐ฌ() is a the HEAVISIDE STEP function.
That will be the ACTIVATION FUNCTION of our network.
Other commonly used AFs are:
๐น Sigmoid
๐น Tanh
๐น Rectified Linear Unit (ReLU)
๐น Leaky ReLU
Ultimately, the ACTIVATION FUNCTION is where the magic happens, because it adds NON-LINEARITY to our MODEL. โจ
This gives us the power to fit virtually any type of data! ๐ฎ
This is (more or less!) what a PERCEPTRON is: the grandfather of modern Neural Networks. ๐ง
Now that we have PERCEPTRONs, let's see how we can use them as the building blocks of more complex networks.
For instance, let's imagine a more complex training data ("ternary" data? ๐ค).
A perceptron can only fit 2/3 of the data.
So, why not using ...THREE of them? ๐
The FIRST perceptron fits the first 2/3 of the data:
1โฃ ๐ด๐๐ต โซ๏ธ
The SECOND perceptron fits the last 2/3 of the data:
2โฃ โซ๏ธ ๐ต๐๐ด
What's left to do now is to use a THIRD perceptron to merge the first two:
3โฃ ๐ฒ=(๐+๐)/2 - 0.5
This is a better view of the resulting network, with each colour indicating a different perceptron.
Pretty neat, right? ๐
Training a network like requires finding the 7 PARAMETERS so that out model fits the training data best.
Modern NNs can have MILLIONS of parameters. ๐คฏ
If we translate that network back into its equation, you can immediately see how messy that looks.
You probably would have never come up with this yourself. But when you think in terms of curve fitting, that becomes much easier to understand.
At this point, you might wonder...
โ What has all of this to do with the reason WHY Neural Networks are so effective? ๐ค
Because we have just built an AND gate! ๐
Likewise, we can also build OR and NOT gates, de-facto proving that NNs are TURING COMPLETE! ๐ฅ๏ธ
This proves that they can perform ANY computation that a more "traditional" computer can. ๐ฅ๏ธ
To continue our analogy with CURVE FITTING, it means that Neural Networks have the potential to fit ANY curve in ANY number of dimensions, with as much precision as you want. ๐คฏ
Any arbitrary 2D curve can be potentially recreated by a NN, in just three steps:
1โฃ Slice the original shape in thin sections ๐ช
2โฃ Fit each section with a perceptron (AND) ๐
3โฃ Use a perceptron to merge all sections (OR) ๐
You can see here that very same principle applied to the design of a Neural Network.
This NN now has 33 parameters to fit, meaning that our search problem is now taking place in a 33-dimensional space. ๐
That is nothing compared to the many millions some NNs nowadays have.
This is, in a nutshell, what Machine Learning is really about.
Making decisions...
...by learning from examples...
...by fitting a curve...
...by finding some numbers...
...that minimise the error of our model over a set of examples.