Convolution is not the easiest operation to understand: it involves functions, sums, and two moving parts.

However, there is an illuminating explanation — with probability theory!

There is a whole new aspect of convolution that you (probably) haven't seen before.

🧵 👇🏽
In machine learning, convolutions are most often applied for images, but to make our job easier, we shall take a step back and go to one dimension.

There, convolution is defined as below.
Now, let's forget about these formulas for a while, and talk about a simple probability distribution: we toss two 6-sided dices and study the resulting values.

To formalize the problem, let 𝑋 and 𝑌 be two random variables, describing the outcome of the first and second toss.
Just for fun, let's also assume that the dices are not fair, and they don't follow a uniform distribution.

The distributions might be completely different.

We only have a single condition: 𝑋 and 𝑌 are independent.
What is the distribution of the sum 𝑋 + 𝑌?

Let's see a simple example. What is the probability that the sum is 4?

That can happen in three ways:

𝑋 = 1 and 𝑌 = 3,
𝑋 = 2 and 𝑌 = 2,
𝑋 = 3 and 𝑌 = 1.
Since 𝑋 and 𝑌 are independent, the joint probability can be calculated by multiplying the individual probabilities together.

Moreover, the three possible ways are disjoint, so the probabilities can be summed up.
In the general case, the formula is the following.

(Don't worry about the index going from minus infinity to infinity. Except for a few terms, all others are zero, so they are eliminated.)

Is it getting familiar?
This is convolution.

We can immediately see this by denoting the individual distributions with 𝑓 and 𝑔.

The same explanation works when the random variables are continuous, or even multidimensional.

Only thing that is required is independence.
Even though images are not probability distributions, this viewpoint helps us make sense of the moving parts and the everyone-with-everyone sum.

If you would like to see an even simpler visualization, just think about this:

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Tivadar Danka

Tivadar Danka Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TivadarDanka

8 Apr
One of my favorite convolutional network architectures is the U-Net.

It solves a hard problem in such an elegant way that it became one of the most performant and popular choices for semantic segmentation tasks.

How does it work?

🧵 👇🏽
Let's quickly recap what semantic segmentation is: a common computer vision task, where we want to classify which class each pixel belongs to.

Because we want to provide a prediction on a pixel level, this task is much harder than classification.
Since the absolutely classic paper Fully Convolutional Networks for Semantic Segmentation by Jonathan Long, Evan Shelhamer, and Trevor Darrell, fully end-to-end autoencoder architectures were most commonly used for this.

(Image source: paper above, arxiv.org/abs/1411.4038v2)
Read 10 tweets
7 Apr
There is a common misconception that all probability distributions are like a Gaussian.

Often, the reasoning involves the Central Limit Theorem.

This is not exactly right: they resemble Gaussian only from a certain perspective.

🧵 👇🏽
Let's state the CLT first. If we have 𝑋₁, 𝑋₂, ..., 𝑋ₙ independent and identically distributed random variables, their scaled sum is a Gaussian distribution in the limit.

The surprising thing here is the limit is independent of the variables' distribution.
Note that the random variables undergo a significant transformation: averaging and scaling with the mean, the variance, and √𝑛.

(The scaling transformation is the "certain perspective" I mentioned in the first tweet.)
Read 12 tweets
1 Apr
Gradient descent sounds good on paper, but there is a big issue in practice.

For complex functions like training losses for neural networks, calculating the gradient is computationally very expensive.

What makes it possible? For one, stochastic gradient descent!

🧵 👇🏽
When you have a lot of data, calculating the gradient of the loss involves the computation of a large sum.

Think about it: if 𝑥ᵢ denotes the data and 𝑤 denotes the weights, the loss function takes the form below.
Not only do we have to add a bunch of numbers together, but we have to find the gradient of each loss term.

For example, if the model contains 10 000 parameters and 1 000 000 data points, we need to compute 10 000 x 1 000 000 = 10¹⁰ derivatives.

This can take a LOT of time.
Read 7 tweets
31 Mar
Gradient descent has a really simple and intuitive explanation.

The algorithm is easy to understand once you realize that it is basically hill climbing with a really simple strategy.

Let's see how it works!

🧵 👇🏽
For functions of one variable, the gradient is simply the derivative of the function.

The derivative expresses the slope of the function's tangent plane, but it can also be viewed as a one-dimensional vector!
When the function is increasing, the derivative is positive. When decreasing, it is negative.

Translating this to the language of vectors, it means that the "gradient" points to the direction of the increase!

This is the key to understand gradient descent.
Read 9 tweets
27 Mar
Sigmoid is one of most commonly used activation functions.

However, it has a serious weakness: Sigmoids often make the gradient disappear.

This can leave the network stuck during training, so they effectively stop learning.

How can this happen?

🧵 👇🏽
Let's take a look at the Sigmoid function first.

Notice below that as 𝑥 tends to ∞ or -∞, the function flattens out.

This can happen for instance when the previous layer separates the classes very well, mapping them far away from each other in the feature space.
Why is this a problem?

Flatness means that the derivative is close to zero, as shown in the figure below.
Read 7 tweets
22 Mar
What if you want to optimize a function, but every evaluation costs you $100 and takes a day to execute?

Algorithms like gradient descent build on two key assumptions:

• function is differentiable,
• and you can calculate it on demand.

What if this is not the case?

🧵 👇🏽
For example, you want to tune the hyperparameters of a model that requires 24 hours of GPU time to train.

Can you find a good enough value under reasonable time and budget?

One method is the so-called Bayesian optimization.
Essentially, the method works as follows.

1️⃣ Model the expensive function with a Gaussian process.

Gaussian processes are easy to compute and offer a way to quantify uncertainty in the predictions.
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!