What you see below is a 2D representation of the MNIST dataset.
It was produced by t-SNE, a completely unsupervised algorithm. The labels were unknown to it, yet it almost perfectly separates the classes. The result is amazing.
This is how the magic is done!
🧵 👇🏽
Even though real-life datasets can have several thousand features, often the data itself lies on a lower-dimensional manifold.
Dimensionality reduction aims to find these manifolds to simplify data processing down the line.
So, we have data points 𝑥ᵢ in a high-dimensional space, looking for lower dimensional representations 𝑦ᵢ.
We want the 𝑦ᵢ-s to preserve as many properties of the original as possible.
For instance, if 𝑥ᵢ is close to 𝑥ⱼ, we want 𝑦ᵢ to be close to 𝑦ⱼ as well.
The t-Distributed Stochastic Neighbor Embedding algorithm (t-SNE) achieves this by modeling the dataset with a dimension-agnostic probability distribution, finding a lower-dimensional approximation with a closely matching distribution.
Let's unravel this a bit.
Since we want to capture an underlying cluster structure, we define a probability distribution on the 𝑥ᵢ-s that reflects this.
For each data point 𝑥ⱼ, we model the probability of 𝑥ᵢ belonging to the same class ("being neighbors") with the formula below.
Essentially, we model each point's neighbors with a Gaussian distribution.
The variance σⱼ is a parameter that is essentially given as an input.
We don't set this directly. Instead we specify the expected number of neighbors, called perplexity.
To make the optimization easier, these probabilities are symmetrized. With these symmetric probabilities, we form the distribution 𝑃.
This represents our high-dimensional data.
Similarly, we define a distribution for the 𝑦ᵢ-s, our (soon to be identified) lower-dimensional representation. This distribution is denoted with 𝑄.
Here, we model the "neighborhood-relation" with the Student t-distribution.
This is where the t in t-SNE comes from.
Our goal is to find the 𝑦ᵢ-s through optimization such that 𝑃 and 𝑄 are as close together as possible. (In a distributional sense.)
This closeness is expressed with the Kullback-Leibler divergence.
We have successfully formulated the dimensionality reduction problem as optimization!
From here, we just calculate the gradient of KL divergence with respect to the 𝑦ᵢ-s and find an optimum with gradient descent.
The results are something like below, where you can see t-SNE's output on the MNIST handwritten digits dataset.
I have already shown this at the beginning of the thread. Now you understand how it was made.
This is one of the most powerful illustrations in machine learning.
Interactive visualizations and learning resources!
1) If you want to explore how the t-SNE works on MNIST, here is a little app where you can play around with the 3D point cloud generated by it.
3) You should also read the original paper "Visualizing High-Dimensional Data Using t-SNE" by Laurens van der Maaten and Geoffrey Hinton. This is where the method was first introduced.
There is a mathematical formula so beautiful that it is almost unbelievable.
Euler's identity combines the famous numbers 𝑒, 𝑖, π, 0, and 1 in a single constellation. At first sight, most people doubt that it is true. Surprisingly, it is.
This is why.
🧵 👇🏽
Let's talk about the famous exponential function 𝑒ˣ first.
Have you ever thought about how is this calculated in practice? After all, raising an irrational number to any power is not trivial.
It turns out that the function can be written as an infinite sum!
In fact, this can be done with many other functions.
For those that are differentiable infinitely many times, there is a recipe to find the infinite sum form. This form is called the Taylor expansion.
It does not always yield the original function, but it works for 𝑒ˣ.
Creative abuse of rules can lead to game-changing discoveries.
In high school, you learned that -1 has no square roots. Yet, by ignoring this, you'll soon discover something that changed mathematics forever: complex numbers.
Follow along, and you'll see how!
🧵 👇🏽
Let's start with a very simple equation:
𝑥² + 1 = 0
Can we solve this? Not at first glance, since the left side of the equation is always larger than one. This is equivalent to solving
𝑥² = -1,
which is (apparently) not possible.
But let's disregard this and imagine a number whose square is -1.
Let's appropriately name it the 𝑖𝑚𝑎𝑔𝑖𝑛𝑎𝑟𝑦 𝑛𝑢𝑚𝑏𝑒𝑟 and denote it with 𝑖.
So, 𝑖² = -1.
Now that we have this strange entity, what can we do?
One of the biggest misconceptions regarding education is that its main purpose is to give knowledge you can immediately use.
It is not.
The best thing education can give you is the mental agility to obtain knowledge at the speed of light.
Let's unpack this idea a bit!
1/7
Consider a course where you build a custom neural network framework with NumPy.
This is hardly usable in practice: working with a custom library is insane.
However, if you know how they are built, you only need to learn the interface to master an actual framework!
2/7
By understanding how the framework is built and how the underlying algorithms work, you'll be able to do much more: experiment with custom optimizers, implement your own layers, etc.
3/7