Effective Pandas🐼 tip [4]:

When you start to work on a real dataset with more data (millions of records) and want to run a transformation on the data, what should you do?

Let me tell you how to make your execution more than 19000 times faster!!
🀯🀯🀯

[1 effective min]

1/7🧡
From the documentation, the way to do that would be using the apply method.

It receives a function that is applied to the data (row or col)

Let's try a basic operation: col2 - col1

2/7🧡
Using that on a dataset with 25 million rows, it took 11 minutes! 🐌🐌🐌

Additionally, it uses a lot of memory! On Kaggle Kernels, it almost used all the 16GB of memory available during processing!

Can we do it faster?πŸ€”

3/7🧡
YES!πŸ‘πŸΎπŸ€©

Instead of thinking loop-wise, let's think vector-wise!

When we apply operations as a vector transformation, this can be heavily parallelized and this has a HUGE impact in performance!

But how can we do that?

4/7🧡
Numpy is known for its vector execution style.

Pandas🐼 has a handy property for every column that will give us a NumPy array with the values

-> df['column'].values

5/7🧡
With the NumPy arrays we can just call our function directly!

The execution on the same 25 million rows took: 34 ms
YES, 34 MILLISECONDS!
🀯🀯🀯

That's 19000 times faster!!!⚑️

6/7🧡
Using your daily tools better saves you time!
And also helps you keep your flow

Use vectorized operations whenever you can for better performance!

Don't forget to:
-> follow me (@gusthema) for daily ML, Python🐍 tips!
-> share with your friends to help them save time!

7/7🧡

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Luiz GUStavo πŸ’‰πŸ’‰πŸŽ‰

Luiz GUStavo πŸ’‰πŸ’‰πŸŽ‰ Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @gusthema

11 Nov
Did you know that you can apply styles to your Pandas visualization?

Let's take a brief look at it πŸ‘€

[1 min]
1/8🧡 Image
Now that you have loaded the data, it's very important to understand it.

To help with that it's good to be able to read it properly and formatting the data definitely help!

Let's come back to the New York Taxi fare

2/8🧡 Image
The fare amount is money.

To format a financial value in Python, we would use the string format "${15,.2f}"

Pandas has a style object and a very similar format method:

3/8🧡 Image
Read 8 tweets
6 Nov
How can we change a 3 minute load time to 1 second?
⚑️⚑️⚑️🀯

As a Pandas🐼 user, the read_csv method might be very dear πŸ’•to you.
But even with a lot of tuning, it will still be slow.

Let's make it faster!!!

[1 ⚑️ min]

1/7🧡
As a ML developer or Data Scientist, [re]loading data is something you do many many times a day!

Having long loading times can make experimentation annoying as everytime you do it, you'll "pay" the time-tax

2/7🧡
One trick to make loading faster is to use a faster file format!

Let's try the Feather file format.

It is a portable that uses the Arrow IPC format: arrow.apache.org/docs/python/ip…

3/7🧡
Read 7 tweets
5 Nov
Imagine you need to load a very large (eg: 5.7GB) csv file to train your model!πŸ€”

This is a very common problem in real world situations and also in many Kaggle competitions!

How can we use Pandas 🐼 effectively to do that?

Let's dive in…

[2 effective min]

1/10🧡
We will use the New York City Taxi Fare Prediction dataset from Kaggle

The csv file has 5.7 GB!!! 😱

Let's try the most obvious thing, just loading it:

df = pd.read_csv("./new-york-city-taxi-fare-prediction/train.csv")

This won't load on Kaggle Kernels!
2/10🧡
That's a bummerβ€¦πŸ˜­

How do I even get to see which columns are in the file?

We can start by loading only some rows (eg: 5) and get some insights.πŸ”

This can give some good information already

3/10🧡
Read 11 tweets
4 Nov
Everyone that does some Data Analysis or Machine Learning knows the Pandas library 🐼

One thing that not everyone is aware of is how to use it efficiently!

Have you thought about how much memory your dataframe is using? πŸ€”

How to use less? πŸ—œοΈ

Let me show you…

[2 min]

1/8🧡
Let's start by loading a csv file.

The example I'll use is the train.csv file from the Kaggle 30 days of ML competition

kaggle.com/c/30-days-of-m…

It's a good example to start

2/8
Loading a csv file in Pandas is very simple:

import pandas as pd
some_data = pd.read_csv("./train.csv")

But how much memory is it using?

3/8
Read 8 tweets
24 Oct
This week I posted about the Code interview!!!πŸ˜±πŸ˜­πŸ€“

Here's a summary if you missed it

[30 sec]

1/🧡
To start, You'll need some tips on how to succeed!πŸ‘πŸΎπŸ‘πŸΎ



2/🧡
To follow the tips and succeed you'll need good resources to study and practice!πŸ“šπŸ‘“

I've got you covered:

3/🧡
Read 6 tweets
22 Oct
Following up on the mock interview we did earlier this week, let me summarize all the topics discussed in the answers

Before starting, thanks to everyone for participating, it was great!

[1.5 min]

🧡
1⃣- Many good answers that, even being a working solution, weren't the fastest ones.
That's ok, but of course an interviewer might follow up asking you: Can you make it faster?

Tip: would a better data structure help you?
2⃣- A working solution is better than no solution!

Sometimes we want to optimize the code but usually you only have 1 hour to finish! Be smart, have a solution and explain how you'd make it perfect

With practice, your good solutions will become the perfect solution by default!
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Thank you for your support!

Follow Us on Twitter!

:(