PCA is an unsupervised learning algorithm that is used to reduce the dimension of large datasets.

For such reason, it's commonly known as a dimensional reduction algorithm.

PCA is one of these useful things that is not talked about. But there is a reason 👇
The PCA's ability to reduce the dimension of the dataset motivates other use cases.

Below are some:

◆ To visualize high dimensional datasets, particularly because visualizing such datasets is impractical.
◆ To select the most useful features while getting rid of useless information/redundant features.

But not always, sometimes useful information will be lost too especially if the original data was already good and didn't contain noises.
◆ PCA can be used before a typical machine learning model training to merely increase the training speed, given that the training data is reduced or no longer contain redundant features.

It is not guaranteed to speed up but in some cases, it can.
In many ML resources, you will find PCA in the category of unsupervised learning algorithms.

Below is a simple reason 👇
PCA reduces the dimension of datasets without instructions (labels in other words) of how that is going to be done other than specifying the number of principal components, just like specifying the number of clusters in KMeans clustering.
This thread is practical. The above was only just a high-level overview.

Let's apply PCA on a dataset you may know/heard of, the red wine dataset.

We will get it from the Scikit-Learn dataset.
It contains 178 data points, 13 features, and 1 target feature having 3 classes(0,1,2).
For now,

Let's apply PCA to the red wine dataset. As a best practice, it's always good to scale the input data.

So, first, I will standardize the data, that is to rescale the values to have 0 mean and unit standard deviation.

Scikit-Learn will take care of that.
In order to reduce the dimension of the dataset, we have to specify the number of principal components.

Think of principal components as coordinates that we want to project the data in.

Or reduced features that hold the most information of the dataset.
Usually, you will choose 2 or 3 components, but in most cases, 2 will be enough.

Below I apply the PCA object and choose a number of principal components, and apply it to the data.
After we have applied the PCA to the dataset, you can see that the dimensions have reduced from 13 to 2.
As you can see below visualizing two 2 principal components, the 3 wine classes are well separated.

In only just 2 components. As you would guess, there wouldn't be another way to scatterplot the entire dataset.
There is one issue with PCA.

Interpreting the components is hard. You can see that the values of the components below but they say nothing.
One way to interpret the results is to use a heatmap to show how much of a given feature is represented in each particular component.

See below (but may not be clear enough)
We can also display the explained variance ratio to see the percentage of the dataset variance explained by each principal component.

pca.explained_variance_ will show the whole variance amount. Whereas pca.explained_variance_ratio_ will show the percentage variance.
The Explained Variance Ration in our case is [0.99809123, 0.00173592].

It means that 99.8% of the dataset variance lies on the first component, and the rest 0.17% lies on the second component.

If you look back to the heatmap above, on the y axis, these ratios can make sense.
As a bonus, let's also use PCA to visualize the digit datasets. It has 64 dimensions, each digit has 8*8 pixels.

We can use PCA to project those 64 dimensions into 2 components.
Like in the first example, we load the data, apply PCA, choosing 2 principal components, and later visualize the information contained into such principal components.

Here I load the data first.
Here I apply PCA to the digit data (not scaled)
And here I visualize the reduced digits.
This is so fantastic.

Imagine that we are able to visualize all 10 digits into one plot, just because we have reduced their dimensions from 64 to 2.
This is the end of the thread.

The thread was about PCA. There are a whole maths behind it, but a lot of time, having a high-level understanding of things like this is quite enough to make things work.
Here are the main key takeaways:

PCA is a dimensional reduction algorithm. It reduces dimensions of the dataset while also preserving as much information as possible into fewer components.
It can also be used to:

◆ Visualize large datasets
◆ Remove redundant features
◆ To speed up model training when applied to the input data (in some cases) before training.
Thank you for reading!

I am actively writing about machine learning techniques, concepts, and ideas.

You can support me by following @Jeande_d and sharing the first tweet with your friends who are interested in ML content.

More content to come 🙌🏻

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jean de Nyandwi

Jean de Nyandwi Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Jeande_d

23 May
Machine Learning has transformed many industries, from banking, healthcare, production, streaming, to autonomous vehicles.

Here are examples of how that is happening👇
🔸A bank or any credit card provider can detect fraud in real-time. Banks can also predict if a given customer requesting a loan will pay it back or not based on their financial history.

2/
🔸A Medical Analyst can diagnose a disease in a handful of minutes, or predict the likelihood or course of diseases or survival rate(Prognosis).
🔸An engineer in a given industry can detect failure or defect on the equipment

3/
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(