Principal Component Analysis is one of the most fundamental techniques in data science.
Despite its simplicity, it has several equivalent forms that you might not have seen.
In this thread, we'll explore what PCA is really doing!
🧵 👇🏽
PCA is most commonly introduced as an algorithm that iteratively finds vectors in the feature space that are
• orthogonal to the previously identified vectors,
• and maximizes the variance of the data projected onto it.
These vectors are called the principal components.
The idea behind this is we want features that convey as much information as possible.
Low variance means that the feature is more concentrated, so it is easier to predict its value in principle.
Features with low enough variances can even be omitted.
However, there is an alternative approach.
Check out our simple dataset below. The features are not only suboptimal in terms of variances but they are also correlated!
If 𝑥₁ is small, 𝑥₂ is large. If 𝑥₁ is large, 𝑥₂ is small. One holds information about the other!
This is suboptimal. In real dimensional datasets having thousands of features, getting rid of the ones that contain no new information makes our job easier.
So, let's decorrelate the features!
Since the covariance matrix is real and symmetric, the spectral decomposition theorem says that we can diagonalize it with orthogonal matrices.
Due to the properties of covariance, we can see that the diagonalized covariance matrix is the covariance matrix of a transformed dataset!
Moreover, it turns out that the row vectors of 𝑈 are the principal components!
This is how the dataset looks after the transformation.
Due to its construction, the features of 𝑌 are uncorrelated. The spectral decomposition theorem also guarantees that the k-th feature is orthogonal to the ones before it and maximizes the variance of the projected data.
This is PCA in broad strokes. If you are interested in the finer details, I have written a blog post about it. Check it out!
Machine learning has enabled scientific breakthroughs in several fields.
Biotechnology is one of the most fascinating, as researchers could perform mindblowing tasks with the new tools.
Here are my favorite problems that machine learning helps to solve!
🧵 👇🏽
These are the topics we are going to talk about:
1. Predicting protein structure from amino acid sequences. 2. Accelerating high-throughput screening for drug discovery. 3. Mapping out the human cell atlas. 4. Precision medicine.
Let's dive in!
1. Predicting protein structure from amino acid sequences.
Proteins are the workhorses of biology. In our body, myriads of processes are controlled by proteins. They enable life. Yet compared to their importance, we know so little about them!
In the last 24 hours, more than 400 of you decided to follow me. Thank you, I am honored!
As you probably know, I love explaining complex machine learning concepts simply. I have collected some of my past threads for you to make sure you don't miss out on them.
Convolution is not the easiest operation to understand: it involves functions, sums, and two moving parts.
However, there is an illuminating explanation — with probability theory!
There is a whole new aspect of convolution that you (probably) haven't seen before.
🧵 👇🏽
In machine learning, convolutions are most often applied for images, but to make our job easier, we shall take a step back and go to one dimension.
There, convolution is defined as below.
Now, let's forget about these formulas for a while, and talk about a simple probability distribution: we toss two 6-sided dices and study the resulting values.
To formalize the problem, let 𝑋 and 𝑌 be two random variables, describing the outcome of the first and second toss.