Tweet

Andrea D'Agostino

May 4 • 10 tweets • 3 min read

[Data Analysis] 🧵
Exploratory data analysis is a fundamental step in any analysis work. You don't have to be a data scientist and be proficient at modeling to be a useful asset to your client if you can do great EDA.

Here's a template of a basic yet powerful EDA workflow👇

EDA is incredibly useful. Proper modeling CANNOT happen without it.

The truth:
Stakeholders NEED it far more than modeling.

EDA empowers the analyst with knowledge about the data, which then moderates the #machinelearning pipeline

While #pandas and #matplotlib are key to good EDA in #python, the real difference are the QUESTIONS you ask to your dataset.

As in all things, these tools are just tools. The real weapon is the analyst. You are in control, not the dataset.

This is where domain > technical knowledge. If you know how to use the basics of pandas and plt, and you are a domain expert YOU ARE DONE.
You are literally a Sherlock Holmes of data. You have the power to query it and get your answers.

This is what clients pay for.

Here's the template I follow. This is the skeleton of it, and I always expand it depending on the data.

It revolves around pandas, matplotlib and seaborn. You rarely need anything else.

Here we go.

1. Understand your data

.describe()
.info()
.isna()
.dtypes
.shape

Get an idea of what's in this dataset. How many categorical variables? How many empty/missing values? And so on

2. Preparation and transformation

- drop useless columns
- rename columns
- handle missing values
- handle duplication
- create new features

This is the first step of the featuring engineering process. You'll get back to this if you want to add information to your model.

3. Univariate analysis

Iterate through each and every relevant variable and get basic information such as

- .hist()
- .value_counts()
- .skew()
- .kurt()

This is the first step of outlier detection. Here's is where you get up and personal with each variable

4. Multivariate analysis

Now you use seaborn to do scatterplots and pairplots

the first allow you to plot 2 variables against each other and understand how they move together.

Pairplots do this for all variables, all at the same time.

Play with the hue parameter in sns

This is it. So let me summarize my basic EDA process.

1. Understand the data
2. Prep & transformation
3. Univariate
4. and multivariate analysis

Expand this as you please. Drop a like and a RT and follow for more #datascience and #machinelearning threads like this 👊

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @theDrewDag

Andrea D'Agostino

@theDrewDag

May 2

[ML tools & tips] 🧵

Have you ever used sklearn's pipeline class to enhance your analysis?

While not mandatory, pipelines bring important benefits if implemented in our code base.

Here is a short thread of why you should use pipelines to improve your #machinelearning work 👇

In #datascience and machine learning, a pipeline is a set of sequential steps that allows us to control the flow of data.
They are very useful as they make our code cleaner, more scalable and readable. They are used to organize the various phases of a project.

Implementing pipelines is not mandatory but has significant advantages, such as

- cleaner code
- less room for error
- implemented like a typical model with .fit()

Read 10 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Andrea D'Agostino

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @theDrewDag

Andrea D'Agostino

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?