Santiago Profile picture
25 Feb, 17 tweets, 5 min read
Today, let's talk about two key data transformations we constantly use in machine learning:

▫️ Label encoding
▫️ One-hot-encoding

But let's not just talk about them, but try to build some intuition about why they are important.

Grab a coffee, and let's start! ☕️🧵👇
Imagine we have a dataset with two features:

▫️ "temperature" — a numeric value.
▫️ "weather" — a string value.

You should feel uncomfortable with this dataset right off the bat: machine learning algorithms usually don't like to work with non-numerical data.

[2 / 15]
To set the record straight, some algorithms don't mind non-numerical data.

For example, certain Decision Tree implementations will be fine with the "weather" feature from our example.

But a lot of them can only work with numbers.

[3 / 15]
Let's look closely at the weather feature. You'll notice there are only three possible values: sunny, overcast, and rainy.

We call these "categorical features." Basically, they are features that can only take a limited number of possible values.

[4 / 15]
Alright, so we want to change this feature, and we know there are only three possible values that it can take.

Let's do the obvious thing: replace each string value with a number.

Now the whole dataset is numeric, so algorithms shouldn't complain.

[5 / 15]
The process of converting a categorical feature to a numerical feature is called "label encoding."

Here is a Python 🐍 example that encodes our "weather" feature.

[6 / 15]
Why that specific order? Why is "overcast" the value 0, "rainy" the value 1, and "sunny" the value 2?

Label encoding is usually done by assigning consecutive values to an alphabetically sorted list of classes.

[7 / 15]
Alright, so now that every feature is numeric, we should be good to go, right?

Well, maybe.

It turns out that a lot of the machine learning algorithms are excellent at extracting subtle relationships in the data.

[8 / 15]
For example, if we are dealing with ratings, a 4-star movie is twice as good as a 2-star one.

Or, in the case of the temperature feature in our example, there's a clear relationship between the values: 80 is warmer than 79, and 75 is the coldest one.

[9 / 15]
Algorithms could pick up these relationships and exploit them!

What does this mean for our weather feature? Is "sunny" (value 2) twice as good as "rainy" (value 1)?

Of course not. We don't want the algorithm to "make up" an inexistent relationship.

[10 / 15]
If we don't want this to happen, we can't label encode the weather feature.

But there's something else we can do: we can one-hot encode it.

If that name seems gibberish to you, you aren't alone. Let's see how this works.

[11 / 15]
We know there are three possible classes for our weather features. One-hot encoding will turn that feature into three different binary new features:

▫️ weather_overcast
▫️ weather_rainy
▫️ weather_sunny

Each column will have a 0 or 1 value.

[12 / 15]
A different way to look at it is by thinking about the attached table.

We turned our weather feature into a vector with a value representing the specific class:

▫️ overcast: [1, 0, 0]
▫️ rainy: [0, 1, 0]
▫️ sunny: [0, 0, 1]

[13 / 15]
The advantage of this method is that we don't inadvertently introduce a relation between different values.

Our problems are not more! 😎

Here is the Python 🐍 code to get our new table with the new three weather features.

[14 / 15]
Here is the deal:

Writing these threads take a lot of time, but it's all worth it if you like/retweet them so others can benefit as well.

And while you are at it, follow me for more threads exploring the little things about machine learning.

Love ya! ✌️

[15 / 15]

🦕
Great question!

For high cardinality, I'd look into hashing encoding as a strong candidate.

There are many other options as well:

- Frequency Encoding
- Mean Encoding
- Weight/Evidence Encoding
- Probability Ratio Encoding
...

Too much of anything is unhealthy 😎

There is a price to pay for increasing the dimensionality of a dataset. You need to keep that in mind when deciding whether to use one-hot-encoding is the right choice.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

26 Feb
Imagine you have a ton of data, but most of it isn't labeled. Even worse: labeling is very expensive. 😑

How can we get past this problem?

Let's talk about a different—and pretty cool—way to train a machine learning model.

☕️👇
Let's say we want to classify videos in terms of maturity level. We have millions of them, but only a few have labels.

Labeling a video takes a long time (you have to watch it in full!) We also don't know how many videos we need to build a good model.

[2 / 9]
In a traditional supervised approach, we don't have a choice: we need to spend the time and come up with a large dataset of labeled videos to train our model.

But this isn't always an option.

In some cases, this may be the end of the project. 😟

[3 / 9]
Read 9 tweets
24 Feb
Let's do a line-by-line analysis of this deep learning model and truly understand what's going on.

This model identifies handwritten digits. It's one of the classic examples of machine learning applied to computer vision.

🧵👇
First of all, we load the MNIST dataset. This dataset contains 28x28 images showing handwritten digits.

This dataset is so popular that Keras built a utility to load it with a single line of code.

The function returns the dataset split into train and test sets.

[2 / 24]
x_train and x_test represent the train and test sets containing the features: the 28x28 matrix representing the image.

If we print both sets' shapes, we will get 60,000 train images and 10,000 test images.

[3 / 24]
Read 25 tweets
23 Feb
When you start with machine learning, it's tempting to learn as many different algorithms and methods as possible.

This is not the best approach. This will not make you the best you can be.

[1 / 5] 🧵👇
Instead, focus on understanding the power of representations and getting as good as you can at feature engineering.

Feed garbage to your fancy algorithms and they will give you garbage back. No exceptions.

[2 / 5]
"Representation" is the process of mapping data into useful features.

"Feature engineering" is the process of determining which features might be useful in training a model.

There's a lot of creativity involved here! The time you spend will pay you back in spades.

[3 / 5]
Read 7 tweets
22 Feb
How do you know you're ready to apply for a data science or machine learning job?

▫️ Remove from your resume everything related to your education: schools, tutorials, certificates, etc.

After getting rid of all of that, would somebody hire you by looking at what's left?

🧵👇
If the answer is no, then you aren't ready.

Your primary asset is the experience you bring to the table. If you have none, finding a job will be hard.

I'm not talking about "years" of experience but your ability to find solutions to problems.

👇
Experience is not necessarily related to having a job either.

In fact, a job may become detrimental to your experience because you'll have to work on something specific for too long.

Just focus on solving problems on your own. Then talk about them.

👇
Read 7 tweets
21 Feb
Go to college. Send your kids. Celebrate those that make it happen.

College is a good thing. If you can afford it, do it.

Most people telling you that college sucks went to college. Most people that didn't go whish their kids would.

🧵👇
You won't replace college with YouTube videos, or reading books, or following tutorials.

Some people may. Most people won't.

Yes, the knowledge is all out there, but college is just not about learning new things.

👇
College doesn't guarantee you a job but look at the statistics of median income and unemployment among those that went and those that didn't.

The numbers should tell us something.

👇
Read 8 tweets
20 Feb
25 popular libraries and frameworks for building machine and deep learning applications.

Covering:

▫️ Data analysis and processing
▫️ Visualizations
▫️ Computer Vision
▫️ Natural Language Processing
▫️ Reinforcement Learning
▫️ Optimization

A mega-thread.

🐍 🧵👇
(1 / 25) TensorFlow

TensorFlow is an end-to-end platform for machine learning. It has a comprehensive, flexible ecosystem of tools and libraries to build and deploy machine learning-powered applications.
(2 / 25) Keras

Keras is a highly-productive deep learning interface running on top of TensorFlow. It provides essential abstractions and building blocks for developing and shipping machine learning solutions with high iteration velocity.
Read 20 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!