[Data Analysis] 🧵
Exploratory data analysis is a fundamental step in any analysis work. You don't have to be a data scientist and be proficient at modeling to be a useful asset to your client if you can do great EDA.

Here's a template of a basic yet powerful EDA workflow👇
EDA is incredibly useful. Proper modeling CANNOT happen without it.

The truth:
Stakeholders NEED it far more than modeling.

EDA empowers the analyst with knowledge about the data, which then moderates the #machinelearning pipeline
While #pandas and #matplotlib are key to good EDA in #python, the real difference are the QUESTIONS you ask to your dataset.

As in all things, these tools are just tools. The real weapon is the analyst. You are in control, not the dataset.
This is where domain > technical knowledge. If you know how to use the basics of pandas and plt, and you are a domain expert YOU ARE DONE.
You are literally a Sherlock Holmes of data. You have the power to query it and get your answers.

This is what clients pay for.
Here's the template I follow. This is the skeleton of it, and I always expand it depending on the data.

It revolves around pandas, matplotlib and seaborn. You rarely need anything else.
Here we go.

1. Understand your data

.describe()
.info()
.isna()
.dtypes
.shape

Get an idea of what's in this dataset. How many categorical variables? How many empty/missing values? And so on
2. Preparation and transformation

- drop useless columns
- rename columns
- handle missing values
- handle duplication
- create new features

This is the first step of the featuring engineering process. You'll get back to this if you want to add information to your model.
3. Univariate analysis

Iterate through each and every relevant variable and get basic information such as

- .hist()
- .value_counts()
- .skew()
- .kurt()

This is the first step of outlier detection. Here's is where you get up and personal with each variable
4. Multivariate analysis

Now you use seaborn to do scatterplots and pairplots

the first allow you to plot 2 variables against each other and understand how they move together.

Pairplots do this for all variables, all at the same time.

Play with the hue parameter in sns
This is it. So let me summarize my basic EDA process.

1. Understand the data
2. Prep & transformation
3. Univariate
4. and multivariate analysis

Expand this as you please. Drop a like and a RT and follow for more #datascience and #machinelearning threads like this 👊

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Andrea D'Agostino

Andrea D'Agostino Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @theDrewDag

May 2
[ML tools & tips] 🧵

Have you ever used sklearn's pipeline class to enhance your analysis?

While not mandatory, pipelines bring important benefits if implemented in our code base.

Here is a short thread of why you should use pipelines to improve your #machinelearning work 👇
In #datascience and machine learning, a pipeline is a set of sequential steps that allows us to control the flow of data.
They are very useful as they make our code cleaner, more scalable and readable. They are used to organize the various phases of a project.
Implementing pipelines is not mandatory but has significant advantages, such as

- cleaner code
- less room for error
- implemented like a typical model with .fit()
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(