Santiago Profile picture
20 Apr, 15 tweets, 4 min read
Yesterday, @PrasoonPratham posted a step-by-step guide to solve the Titanic challenge on Kaggle.

I thought it'd be fun to engineer some features that can help build an even better model.

Here are some ideas worth considering.

↓ 1/10
Attached you can find the original set of input variables that come with the data to solve the problem.

We are going to transform some of these into features that should help our model produce better results.

This is what Feature Engineering is all about.

↓ 2/13
Keep in mind that these are just hypotheses that you'll have to try and validate.

Some of these suggestions might not improve the results or could even make the model perform worse.

This is an exercise to try and think creatively about the data we are getting.

↓ 3/13
1. Let's start with "Age."

It makes sense for the age of a passenger to influence whether the person survives the crash.

People that were too old or young children probably had different chances.

But there's really no difference between a 25 and a 27 years old.

↓ 4/13
We can consolidate the information we want our model to learn by turning "Age" into a one-hot encoded feature with three possible values:

• [1, 0, 0]: Anyone younger than 10.
• [0, 1, 0]: Anyone between 11 and 64.
• [0, 0, 1]: Anyone older than 64.

↓ 5/13
2. We are given a "ticket" variable that represents the ticket number of the passenger.

I don't see how this feature would influence the model's outcome. We can drop it.

What you don't feed to your model is as important as what you give to it.

↓ 6/13
3. Let's work now with the "cabin" number.

The actual number here is not important. We mostly care about whether or not the passenger had a cabin.

This is probably important to determine whether you had time to save yourself based on where you were.

↓ 7/13
We also know that affluent people had a better chance of surviving, and having a cabin is definitely correlated to that.

We can transform "cabin" into a 0/1 feature.

↓ 8/13
Another way to process the "cabin" variable is by extracting the floor where the cabin was located in the ship. (Assuming it's provided.)

I'd guess that people on the top floors had a different chance of survival than those on the bottom floors.

↓ 9/13
4. The "sibsp" and "parch" variables tell us the number of siblings plus spouses and parents plus children, respectively.

Does this matter?

Here is a theory: people with family on the ship might have had a different chance of survival than those traveling alone.

↓ 10/13
Maybe some people couldn't save themselves because they had to protect others.

Maybe some people survived because others protected them.

Either way, we could combine these two variables into a 0/1 feature indicating whether the passenger traveled alone.

↓ 11/13
You can probably take it from here.

The Titanic challenge is a great exercise to explore different hypotheses and practice feature engineering.

↓ 12/13
Two more things:

First, be careful while engineering features not to introduce biases. Use the data you have as much as you can to drive these decisions.

Second, here is Pratham's thread walking through the steps to solve the challenge: .

13/13
If you found this thread helpful, follow me @svpino for weekly posts touching on machine learning and how to use it to build real-life systems.

It’s all about building value, and that’s way more fun if we do it together!
That's a good insight. Probably a good idea to explore the following and see if any of them (or both) work:

• has_a_cabin: 0 | 1
• side_of_ship: [0, 0] | [1, 0] | [0, 1]

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

22 Apr
A 13-tweet introduction to one of the most basic structures used in machine learning: a tensor.

Understanding how tensors work is fundamental. They aren't complex but working with them may get confusing if you don't understand all the pieces.

Let's solve that today.

↓ 1/13
Three primary attributes define a tensor:

• Rank: Number of axes.
• Shape: Number of dimensions per axis.
• Data type: Type of data contained in it.

↓ 2/13 Image
The rank of a tensor refers to the tensor's number of axes.

Examples:

• Rank of a matrix is 2.
• Rank of a vector is 1.
• Rank of a scalar is 0.

↓ 3/13
Read 14 tweets
21 Apr
700 people have watched "How To Get Started With Machine Learning." 86 have rated it.

Let's celebrate!

• You can buy the course today for $7.
• $0 if you don't like it.
• Back to $15 tomorrow.

gum.co/kBjbC/only7

If you can't afford it, keep reading:

ImageImage
For every copy I sell today, I'll give away one for free.

To apply for the free copy, reply below with why you think this course will help you.

I'll prioritize the best stories I read.

If you want to support my content, like/retweet this thread, so more people see it.
So far, 8 copies sold, and 8 free copies shared.

Thanks for the continuous support! It helps tremendously!
Read 7 tweets
21 Apr
Creating a good machine learning model is really sexy. That's what's different and where everyone focuses all of their attention.

But machine learning is much more than that.

A thread with a few thoughts about the real job.

1/9
Machine learning engineers spend a lot of time designing and training new models, but this is just a small fraction of their job.

2/9
In reality, dealing with data and operationalizing models is much more time-consuming and sometimes even harder and more involved than creating the models in the first place.

3/9
Read 10 tweets
20 Apr
The backbone of my end-to-end machine learning setup:

• A 48-page Field Notes
• Python
• NumPy, Pandas, Matplotlib, OpenCV
• Scikit-Learn, XGBoost
• TensorFlow
• Google Colab, Jupyter, VSCode
• Docker, Flask
• AWS SageMaker
I personally don't use C/C++.

That doesn't mean it's not useful. I know plenty of people in the industry that rely on C/C++ to do their work.

It just means that I personally haven't needed it.

There are a lot of satellite tools that I have to use depending on the project. Kinesis, Airflow, SQS... the list is endless.

I just tried to list the core of what I need, and it rarely varies.

Read 5 tweets
19 Apr
Is 10 twice as worse as 5? Sometimes it is, but sometimes it's even worse.

This is the question I always ask myself when deciding how to penalize my models.

Read on for more details and a couple of examples:

↓ 1/11
When we are training a machine learning model, we need to compute how different our predictions are from the expected results.

For example, if we predict a house's price as $150,000, but the correct answer is $200,000, our "error" is $50,000.

↓ 2/11
There are multiple ways we can compute this error, but two common choices are:

• RMSE — Root Mean Squared Error
• MAE — Mean Absolute Error

Both of these have different properties that will shine depending on the problem you want to solve.

↓ 3/11
Read 12 tweets
18 Apr
Have you upgraded your project to Python 🐍 3.9 yet?

Read on for some of the new syntax and built-in features in Python that you don't want to miss.

1/5
1. You can now merge dictionaries by using a new operator "|".

See PEP 584 for more information: python.org/dev/peps/pep-0….

↓ 2/5 Image
2. There's another new operator "|=". This one will let you update a dictionary.

See PEP 584 for more information: python.org/dev/peps/pep-0….

↓ 3/5 Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!