Imagine you need to load a very large (eg: 5.7GB) csv file to train your model!πŸ€”

This is a very common problem in real world situations and also in many Kaggle competitions!

How can we use Pandas 🐼 effectively to do that?

Let's dive in…

[2 effective min]

1/10🧡
We will use the New York City Taxi Fare Prediction dataset from Kaggle

The csv file has 5.7 GB!!! 😱

Let's try the most obvious thing, just loading it:

df = pd.read_csv("./new-york-city-taxi-fare-prediction/train.csv")

This won't load on Kaggle Kernels!
2/10🧡
That's a bummerβ€¦πŸ˜­

How do I even get to see which columns are in the file?

We can start by loading only some rows (eg: 5) and get some insights.πŸ”

This can give some good information already

3/10🧡
But wait!

And if we try to load all the dataset but in batches?
Is that a thing?

Yes! The read_csv method has the parameter chunksize
We can later concatenate all chunks with the method concat

But which chunk size should we use? How many rows does the dataset have?

4/10🧡
A fast way to know how many lines are in the file is to use the wc unix command!

It will take ~6 seconds!

And now we know that there are 55 million rows in the file!

5/10🧡
With this information, we can try some guessing. Let's try a chunk size of 5 million rows

To help with the memory pressure, let's convert the 'pickup_datetime' column to a date object (instead of just object type)

πŸ‘‰πŸΎWithout this you will not be able to load again!πŸ‘€

6/10🧡
We manage to load it!! πŸ₯³πŸŽ‰
But it's using +3 GB of memory and it took 3 min!

Can we do better?

Yes!!!

Using what I posted yesterday:

7/10🧡
Let's convert all those float64 to float32, too

The read_csv has parameters to help the conversion!

With this, the results are way better!!! πŸ₯³πŸŽ‰πŸΎ

-> +3.3 GB -> 1.5 GB πŸ—œοΈ
-> 3:10 min -> 2:50 min ⚑️

8/10🧡
I learned many of these tricks on this great Kaggle Kernel: kaggle.com/szelee/how-to-…

Go give a +1 to it as it really deserves!

9/10🧡
Using Pandas🐼 effectively is very important!

The tricks on this thread can enable you to load huge amounts of data with some small changes!

Don't forget to share this and help your friends!

Also follow me (@gusthema) for daily ML, Python and Career tips!

10/10🧡
If you liked this thread, you don't want to miss the follow up:

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Luiz GUStavo πŸ’‰πŸ’‰πŸŽ‰

Luiz GUStavo πŸ’‰πŸ’‰πŸŽ‰ Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @gusthema

6 Nov
How can we change a 3 minute load time to 1 second?
⚑️⚑️⚑️🀯

As a Pandas🐼 user, the read_csv method might be very dear πŸ’•to you.
But even with a lot of tuning, it will still be slow.

Let's make it faster!!!

[1 ⚑️ min]

1/7🧡 Image
As a ML developer or Data Scientist, [re]loading data is something you do many many times a day!

Having long loading times can make experimentation annoying as everytime you do it, you'll "pay" the time-tax

2/7🧡
One trick to make loading faster is to use a faster file format!

Let's try the Feather file format.

It is a portable that uses the Arrow IPC format: arrow.apache.org/docs/python/ip…

3/7🧡
Read 7 tweets
4 Nov
Everyone that does some Data Analysis or Machine Learning knows the Pandas library 🐼

One thing that not everyone is aware of is how to use it efficiently!

Have you thought about how much memory your dataframe is using? πŸ€”

How to use less? πŸ—œοΈ

Let me show you…

[2 min]

1/8🧡
Let's start by loading a csv file.

The example I'll use is the train.csv file from the Kaggle 30 days of ML competition

kaggle.com/c/30-days-of-m…

It's a good example to start

2/8
Loading a csv file in Pandas is very simple:

import pandas as pd
some_data = pd.read_csv("./train.csv")

But how much memory is it using?

3/8
Read 8 tweets
24 Oct
This week I posted about the Code interview!!!πŸ˜±πŸ˜­πŸ€“

Here's a summary if you missed it

[30 sec]

1/🧡
To start, You'll need some tips on how to succeed!πŸ‘πŸΎπŸ‘πŸΎ



2/🧡
To follow the tips and succeed you'll need good resources to study and practice!πŸ“šπŸ‘“

I've got you covered:

3/🧡
Read 6 tweets
22 Oct
Following up on the mock interview we did earlier this week, let me summarize all the topics discussed in the answers

Before starting, thanks to everyone for participating, it was great!

[1.5 min]

🧡
1⃣- Many good answers that, even being a working solution, weren't the fastest ones.
That's ok, but of course an interviewer might follow up asking you: Can you make it faster?

Tip: would a better data structure help you?
2⃣- A working solution is better than no solution!

Sometimes we want to optimize the code but usually you only have 1 hour to finish! Be smart, have a solution and explain how you'd make it perfect

With practice, your good solutions will become the perfect solution by default!
Read 8 tweets
19 Oct
When I was studying for my technical interviews I used a couple of different resources

Here is a list of the 4 most important ones..

[And some bonus ones! 🎁🎁]

[1 minute of investment]

1/8🧡
1⃣ The Algorithm Design Manual by Steven S. Skiena

Is a great book to study basic and advanced algorithms! The text is very clear and good to learn or review.

2/8🧡
2⃣ Introduction to Algorithms by Thomas H. Cormen, Charles E. Leiserson, Ronald L Rivest, Clifford Stein

It's also known as CLRS

This is one of the main books on Computer Algorithms.
It's very deep!



3/8🧡
Read 8 tweets
18 Oct
Are you afraid of the code interviews?😱😱😱
You're not alone!πŸ«‚

I've done many on both sides (candidate and interviewer)!

Here are some tips that helped me succeed:
πŸ‘πŸΎπŸ‘πŸΎπŸ‘πŸΎ

[1.5 minutes]

1/10🧡
During the interview you are being evaluated in many aspects and not only your coding skills

So while solving the technical questions, talk to the interviewer and explain what you are trying to achieve.

Communication is an important skill!

2/10🧡
Some people get anxious during the interview and might get blocked or forget even the basics. This is a very common problem

To overcome this what I did was:
β€’ Practice
β€’ Practice
β€’ Practice!

I solved many many problems on the white board before going for an interview

3/10🧡
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(