Principal Component Analysis is a commonly used method for dimensionality reduction.
It's a good example of how fairly complex math can have an intuitive explanation and be easy to use in practice.
Let's start from the application of PCA π
Dimensionality Reduction
This is one of the common uses of PCA in machine learning.
Imagine you want to predict house prices. You get a large table of many houses and different features for them like size, number of rooms, location, age, etc.
Some features seem correlated π
Correlated features
For example, the size of the house is correlated with the number of rooms. Bigger houses tend to have more rooms.
Another example could be the age and the year the house was built - they give us pretty much the same information.
We don't want that π
Curse of Dimensionality
In general, we want to have fewer features, because of the Curse of Dimensionality.
The amount of data required to fit a model increases exponentially with the number of features. Therefore, having many features telling us the same thing is bad.
π
Remove features
Then, let's remove the features we don't need. But how do you select these?
Take a look at the code. Using scikit-learn and PCA you can take a dataset containing many features and transform it into a dataset with fewer features (5 in the example).
π
What you can use PCA for is to essentially create new features that will "compress" the data of the full feature set as well as possible.
In this way, redundant features will effectively be removed. This is what we call dimensionality reduction.
Why does it work π
To understand better what happens, let's look at a specific example using the awesome visualization by @vicapow.
I'll post images and videos in this thread, but I encourage you to click the link and experiment yourself a bit.
Guess what, you are now doing dimensionality reduction!
You see, you are looking at a 3D model on a screen that is 2D. While every point in the model is defined by 3 features (coordinates), you only select 2 features to look at it.
π
@vicapow As the model is rotating, you can observe that some views are better at displaying the data than others.
Look at the screenshots I did below. On the first one, you can clearly see that there are 3 clusters in the data, while on the second you only see 2.
Can PCA help? π
@vicapow I wrote above that PCA will find a new reduced set of features that optimally encodes the full dataset.
So, if we apply it to the 3D model, we get the following view of the data, which clearly shows the 3 clusters. The optimal view!
π
@vicapow Now you can imagine that on the 3 dimensions you have features describing a house.
And the good thing is that PCA doesn't care how many dimensions you have and how many you want to reduce your dataset.
Now a bit about the math π
@vicapow What happens is that PCA will rotate your dataset in a way that in the first dimension (first principal component) you have the highest variability of the data, on the second dimension - the second-highest, and so on.
The reduction happens by removing the last dimensions.
π
@vicapow Now, go back to the visualization and play with the 2D example.
Move the points in the left around and observe how the PCA plot changes. Put all points on a diagonal like (highly correlated) and you will see that after PCA the second dimension almost disappears.
For regression problems you can use one of several loss functions:
βͺοΈ MSE
βͺοΈ MAE
βͺοΈ Huber loss
But which one is best? When should you prefer one instead of the other?
Thread π§΅
Let's first quickly recap what each of the loss functions does. After that, we can compare them and see the differences based on some examples.
π
Mean Square Error (MSE)
For every sample, MSE takes the difference between the ground truth and the model's prediction and computes its square. Then, the average over all samples is computed.
The Cross-Entropy Loss function is one of the most used losses for classification problems. It tells us how well a machine learning model classifies a dataset compared to the ground truth labels.
The Binary Cross-Entropy Loss is a special case when we have only 2 classes.
π
The most important part to understand is this one - this is the core of the whole formula!
Here, Y denotes the ground-truth label, while ΕΆ is the predicted probability of the classifier.
Let's look at a simple example before we talk about the logarithm... π
When machine learning met crypto art... they fell in love β€οΈ
The Decentralized Autonomous Artist (DAA) is a concept that is uniquely enabled by these technologies.
Meet my favorite DAA - Botto.
Let me tell you how it works π
Botto uses a popular technique to create images - VQGAN+CLIP
In simple terms, it uses a neural network model generating images (VQCAN) guided by the powerful CLIP model which can relate images to text.
This method can create stunning visuals from a simple text prompt!
π
Creating amazing images, though, requires finding the right text prompt
Botto is programmed by its creator - artist Mario Klingemann (@quasimondo), but it creates all art itself. There is no human intervention in the creation of the images!