Tweet

Tivadar Danka

28 Apr, 10 tweets, 3 min read

Principal Component Analysis is one of the most fundamental techniques in data science.

Despite its simplicity, it has several equivalent forms that you might not have seen.

In this thread, we'll explore what PCA is really doing!

🧵 👇🏽

PCA is most commonly introduced as an algorithm that iteratively finds vectors in the feature space that are

• orthogonal to the previously identified vectors,
• and maximizes the variance of the data projected onto it.

These vectors are called the principal components.

The idea behind this is we want features that convey as much information as possible.

Low variance means that the feature is more concentrated, so it is easier to predict its value in principle.

Features with low enough variances can even be omitted.

However, there is an alternative approach.

Check out our simple dataset below. The features are not only suboptimal in terms of variances but they are also correlated!

If 𝑥₁ is small, 𝑥₂ is large. If 𝑥₁ is large, 𝑥₂ is small. One holds information about the other!

This is suboptimal. In real dimensional datasets having thousands of features, getting rid of the ones that contain no new information makes our job easier.

So, let's decorrelate the features!

Since the covariance matrix is real and symmetric, the spectral decomposition theorem says that we can diagonalize it with orthogonal matrices.

(See the spectral theorem: en.wikipedia.org/wiki/Symmetric…)

Due to the properties of covariance, we can see that the diagonalized covariance matrix is the covariance matrix of a transformed dataset!

Moreover, it turns out that the row vectors of 𝑈 are the principal components!

This is how the dataset looks after the transformation.

Due to its construction, the features of 𝑌 are uncorrelated. The spectral decomposition theorem also guarantees that the k-th feature is orthogonal to the ones before it and maximizes the variance of the projected data.

This is PCA in broad strokes. If you are interested in the finer details, I have written a blog post about it. Check it out!

towardsdatascience.com/understanding-…

https://twitter.com/TivadarDanka/status/1387399961143353347

If you enjoyed this explanation, consider following me and hitting a like/retweet on the first tweet of the thread!

I regularly post simple explanations of seemingly complicated concepts in machine learning, make sure you don't miss out on the next one!

https://twitter.com/TivadarDanka/status/1387399961143353347

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @TivadarDanka

Tivadar Danka

@TivadarDanka

27 Apr

Have you ever wondered why include the logarithm in the definition of log-likelihood?

The answer is simple: logarithm makes differentiation of products easier.

Let's see why!

🧵 👇🏽

Although the derivative of a sum is the sum of derivatives, a similar property cannot be stated about the product of functions.

The derivative of a product is slightly more complicated: it is a sum of products.

The formula gets even more complicated when we have more functions in the product.

When potentially hundreds of terms are present, like in the likelihood function, computing this is not feasible.

Read 6 tweets

Tivadar Danka

@TivadarDanka

26 Apr

Machine learning has enabled scientific breakthroughs in several fields.

Biotechnology is one of the most fascinating, as researchers could perform mindblowing tasks with the new tools.

Here are my favorite problems that machine learning helps to solve!

🧵 👇🏽

These are the topics we are going to talk about:

1. Predicting protein structure from amino acid sequences.
2. Accelerating high-throughput screening for drug discovery.
3. Mapping out the human cell atlas.
4. Precision medicine.

Let's dive in!

1. Predicting protein structure from amino acid sequences.

Proteins are the workhorses of biology. In our body, myriads of processes are controlled by proteins. They enable life. Yet compared to their importance, we know so little about them!

Read 15 tweets

Tivadar Danka

@TivadarDanka

19 Apr

Softmax is one of the most commonly used functions in machine learning.

It is used to transform high-level features into probabilities. Based on the formula, it is hard to imagine how it is done exactly.

Softmax might not be what you think it is. Let's find out why!

🧵 👇🏽

First, we start with the exponential function eˣ, which transforms a real number into a positive one.

It has a feature that shows the geometry of this transformation: it turns addition into multiplication.

In particular, eᵃ ⁺ ᵇ = eᵃ eᵇ holds.

The input x = (x₁, x₂, ..., xₙ) consists of the highest level features: the class scores.

For two vectors x and y, xᵢ - yᵢ expresses the difference between features.

After the exponential function, this is transformed into their ratio.

Read 10 tweets

Tivadar Danka

@TivadarDanka

16 Apr

In the last 24 hours, more than 400 of you decided to follow me. Thank you, I am honored!

As you probably know, I love explaining complex machine learning concepts simply. I have collected some of my past threads for you to make sure you don't miss out on them.

Enjoy!

https://twitter.com/TivadarDanka/status/1359876189943382017

1. What is expected value?

https://twitter.com/TivadarDanka/status/1359876189943382017

https://twitter.com/TivadarDanka/status/1360237067826065411

2. What is entropy?

https://twitter.com/TivadarDanka/status/1360237067826065411

Read 16 tweets

Tivadar Danka

@TivadarDanka

15 Apr

In machine learning, the inner product (or dot product) of vectors is often used to measure similarity.

However, the formula is far from revealing. What does the sum of coordinate products have to do with similarity?

There is a very simple geometric explanation!

🧵 👇🏽

There are two key things to observe.

First, the inner product is linear in both variables. This property is called bilinearity.

Second, is that the inner product is zero if the vectors are orthogonal.

Read 9 tweets

Tivadar Danka

@TivadarDanka

13 Apr

Convolution is not the easiest operation to understand: it involves functions, sums, and two moving parts.

However, there is an illuminating explanation — with probability theory!

There is a whole new aspect of convolution that you (probably) haven't seen before.

🧵 👇🏽

In machine learning, convolutions are most often applied for images, but to make our job easier, we shall take a step back and go to one dimension.

There, convolution is defined as below.

Now, let's forget about these formulas for a while, and talk about a simple probability distribution: we toss two 6-sided dices and study the resulting values.

To formalize the problem, let 𝑋 and 𝑌 be two random variables, describing the outcome of the first and second toss.

Read 9 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Tivadar Danka

Try unrolling a thread yourself!

More from @TivadarDanka

Tivadar Danka

Tivadar Danka

Tivadar Danka

Tivadar Danka

Tivadar Danka

Tivadar Danka

Did Thread Reader help you today?

Like this author's thread?