Have you ever thought about why neural networks are so powerful?

Why is it that no matter the task, you can find an architecture that knocks the problem out of the park?

One answer is that they can approximate any function with arbitrary precision!

Let's see how!

🧵 👇🏽
From a mathematical viewpoint, machine learning is function approximation.

If you are given data points 𝑥 with observations 𝑦, learning essentially means finding a function 𝑓 such that 𝑓(𝑥) approximates the given 𝑦-s as accurately as possible.
Approximation is a very natural idea in mathematics.

Let's see a simple example!

You probably know the exponential function well. Do you also know how to calculate it?

The definition itself doesn't really help you. Calculating the powers where 𝑥 is not an integer is tough.
The solution is to approximate the exponential function with one that we can calculate easier.

Here, we use a polynomial for this purpose.

The larger 𝑛 is, the closer the approximation is to the true value.
The central problem of approximation theory is to provide a mathematical framework for these problems.

So, let's formulate this in terms of neural networks!

Suppose that there is an underlying function 𝑔(𝑥) that describes the true relation between data and observations.
Our job is to find an approximating function f(x) that

• generalizes the knowledge from the data,
• and is computationally feasible.

If we assume that all of our data points are in the set 𝑋, we would like a function where the quantity 𝑠𝑢𝑝𝑟𝑒𝑚𝑢𝑚 𝑛𝑜𝑟𝑚.
You can imagine this quantity by plotting the functions, coloring the area enclosed by the graph and calculating the maximum spread of said area along the 𝑦 axis.
Mathematically speaking, a neural network with a single hidden layer is defined by the formula below, where φ can be any activation function, like the famous sigmoid.

It checks our second criteria: it is easy to compute.
What about the other one? Is this function family enough to approximate any reasonable function?

This question is answered by the 𝑢𝑛𝑖𝑣𝑒𝑟𝑠𝑎𝑙 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑡ℎ𝑒𝑜𝑟𝑒𝑚, straight from 1989.

Let's see what it says!
(The result appeared in the paper

Cybenko, G. (1989) “Approximations by superpositions of sigmoidal functions”, Mathematics of Control, Signals, and Systems, 2(4), 303–314.

You can find it here: citeseerx.ist.psu.edu/viewdoc/downlo…)
Essentially, this means that as long as the activation function is sigmoid-like and the function to be approximated is continuous, a neural network with a single hidden layer can approximate it as precisely as you want.

(Or 𝑙𝑒𝑎𝑟𝑛 it in machine learning terminology.)
Notice that the theorem is about neural networks with a single hidden layer.

This is actually not a restriction. The general case with multiple hidden layers case follows easily from this.

However, proving the theorem is easier in this simple situation.
There are certain caveats.

The main issue is that the theorem doesn't say anything about 𝑁, that is, the number of hidden neurons.

If you want a close approximation, 𝑁 can be really large.

From a computational perspective, this is highly suboptimal.
But this is not the only problem.

In practice, we have incomplete information about the function to be approximated. (This information is represented by our dataset.)

If we make our approximation too precise, we are overfitting on our dataset.
This is the same problem you might have when trying to fit a polynomial curve to the dataset.
Overall, the universal approximation theorem is a theoretical result, providing a solid foundation for neural networks.

Without this, they wouldn't be as powerful as they are.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tivadar Danka

Tivadar Danka Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TivadarDanka

24 Feb
Conditional probability is one of the central concepts of statistics and probability theory.

Without a way to account for including prior information in our models, statistical models would be practically useless.

Let's see what conditional probability simply means! Image
If 𝐴 are 𝐵 are two events, they are not necessarily independent of each other.

This means that the occurrence of one can give information about the other.

When performing statistical modeling, this is frequently the case.

To illustrate, we will take a look at spam filters!
Suppose that you have 100 mails in your inbox.

40 is spam, 60 is not.

Based only on this information, if you receive a random letter, there is a 40% chance that it is spam.

This is not sufficient to build a decent model for spam detection. Image
Read 7 tweets
23 Feb
What makes it possible to train neural networks with gradient descent?

The fact that the loss function of a network is a differentiable function!

Differentiation can be hard to understand. However, it is an intuitive concept from physics.

💡 Let's see what it really is! 💡
Differentiation essentially describes a function's rate of change.

Let's see how!

Suppose that we have a tiny object moving along a straight line back and forth.

Its movement is fully described by its distance from the starting point, plotted against the time.
What is its average speed in its 10 seconds of travel time?

The average speed is simply defined as the ratio of distance and time.

However, it doesn't really describe the entire movement. As you can see, the speed is sometimes negative, sometimes positive.
Read 11 tweets
22 Feb
You can explain the Bayes formula in pure English.

Even without using any mathematical terminology.

Despite being overloaded with seemingly complex concepts, it conveys an important lesson about how observations change our beliefs about the world.

Let's take it apart!
Essentially, the Bayes formula describes how to update our models, given new information.

To understand why, we will look at a simple example with a twist: coin tossing with an unfair coin.
Let's suppose that we have a magical coin! It can come up with heads or tails when tossed, but not necessarily with equal probability.

The catch is, we don't know the exact probability. So, we have to perform some experiments and statistical estimation to find that out.
Read 14 tweets
21 Feb
It is the weekend now, so let's talk about something different, but still awesome and beautiful!

This image has been my desktop wallpaper for years.

Can you guess what is it?

This machine represents one of the most brilliant ideas I have seen. (Answer in the next tweet.) Image
This is the Wankel engine, a surprisingly innovative type of internal combustion engines.

Why is it so brilliant? In short, because it parallelizes the classical four-stage Otto cycle, all in one chamber!

To elaborate a bit, let's see how a four-stroke piston engine works!
The common four-stroke piston engine essentially has four stages:

1. Intake
2. Compression
3. Combustion
4. Exhaust

These happen in sequence inside a cylinder-shaped chamber, as shown below.

(Gifs and images in the thread are all from Wikipedia.)
Read 8 tweets
16 Feb
At telesto.ai, we realized that we made a crucial mistake in organizing our workflow.

Up until now, we always started with the backend API when developing new features. Then, we added the UI.

You definitely shouldn't do that.

Let me explain why!
You always notice crucial flaws in the UI when seeing it for the first time.

It may be hard to use or straight-up lack functionality that you missed during planning.

However, changes require backend modifications as well. You have to do the work twice!
So, our workflow is now the following.

1. Sketch the UI in Figma.

2. Walk through the user flow several times.

3. Spot flaws and correct the UI.

4. Repeat 1-3 at least once.

5. Move on to design and implement corresponding backend functionality.
Read 4 tweets
16 Feb
Mean Square Error is one of the most ubiquitous error functions in machine learning.

Did you know that it arises naturally from Bayesian estimation? That seemingly rigid formula has a deep probabilistic meaning.

💡 Let's unravel it! 💡
If you are not familiar with the MSE, first check out this awesome explanation by @haltakov!

In the following, we are going to dig deep into the Bayesian roots of the formula!

()
Suppose that you have a regression problem, like predicting apartment prices from square foot.

The data seems to follow a clear trend, although the variance is large. Fitting a function could work, but it seems wrong.
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!