Today, let's talk about two key data transformations we constantly use in machine learning:
▫️ Label encoding
▫️ One-hot-encoding
But let's not just talk about them, but try to build some intuition about why they are important.
Grab a coffee, and let's start! ☕️🧵👇
Imagine we have a dataset with two features:
▫️ "temperature" — a numeric value.
▫️ "weather" — a string value.
You should feel uncomfortable with this dataset right off the bat: machine learning algorithms usually don't like to work with non-numerical data.
[2 / 15]
To set the record straight, some algorithms don't mind non-numerical data.
For example, certain Decision Tree implementations will be fine with the "weather" feature from our example.
But a lot of them can only work with numbers.
[3 / 15]
Let's look closely at the weather feature. You'll notice there are only three possible values: sunny, overcast, and rainy.
We call these "categorical features." Basically, they are features that can only take a limited number of possible values.
[4 / 15]
Alright, so we want to change this feature, and we know there are only three possible values that it can take.
Let's do the obvious thing: replace each string value with a number.
Now the whole dataset is numeric, so algorithms shouldn't complain.
[5 / 15]
The process of converting a categorical feature to a numerical feature is called "label encoding."
Here is a Python 🐍 example that encodes our "weather" feature.
[6 / 15]
Why that specific order? Why is "overcast" the value 0, "rainy" the value 1, and "sunny" the value 2?
Label encoding is usually done by assigning consecutive values to an alphabetically sorted list of classes.
[7 / 15]
Alright, so now that every feature is numeric, we should be good to go, right?
Well, maybe.
It turns out that a lot of the machine learning algorithms are excellent at extracting subtle relationships in the data.
[8 / 15]
For example, if we are dealing with ratings, a 4-star movie is twice as good as a 2-star one.
Or, in the case of the temperature feature in our example, there's a clear relationship between the values: 80 is warmer than 79, and 75 is the coldest one.
[9 / 15]
Algorithms could pick up these relationships and exploit them!
What does this mean for our weather feature? Is "sunny" (value 2) twice as good as "rainy" (value 1)?
Of course not. We don't want the algorithm to "make up" an inexistent relationship.
[10 / 15]
If we don't want this to happen, we can't label encode the weather feature.
But there's something else we can do: we can one-hot encode it.
If that name seems gibberish to you, you aren't alone. Let's see how this works.
[11 / 15]
We know there are three possible classes for our weather features. One-hot encoding will turn that feature into three different binary new features:
There is a price to pay for increasing the dimensionality of a dataset. You need to keep that in mind when deciding whether to use one-hot-encoding is the right choice.
Imagine you have a ton of data, but most of it isn't labeled. Even worse: labeling is very expensive. 😑
How can we get past this problem?
Let's talk about a different—and pretty cool—way to train a machine learning model.
☕️👇
Let's say we want to classify videos in terms of maturity level. We have millions of them, but only a few have labels.
Labeling a video takes a long time (you have to watch it in full!) We also don't know how many videos we need to build a good model.
[2 / 9]
In a traditional supervised approach, we don't have a choice: we need to spend the time and come up with a large dataset of labeled videos to train our model.
But this isn't always an option.
In some cases, this may be the end of the project. 😟
25 popular libraries and frameworks for building machine and deep learning applications.
Covering:
▫️ Data analysis and processing
▫️ Visualizations
▫️ Computer Vision
▫️ Natural Language Processing
▫️ Reinforcement Learning
▫️ Optimization
A mega-thread.
🐍 🧵👇
(1 / 25) TensorFlow
TensorFlow is an end-to-end platform for machine learning. It has a comprehensive, flexible ecosystem of tools and libraries to build and deploy machine learning-powered applications.
(2 / 25) Keras
Keras is a highly-productive deep learning interface running on top of TensorFlow. It provides essential abstractions and building blocks for developing and shipping machine learning solutions with high iteration velocity.