[Data Analysis] 🧵
Exploratory data analysis is a fundamental step in any analysis work. You don't have to be a data scientist and be proficient at modeling to be a useful asset to your client if you can do great EDA.
Here's a template of a basic yet powerful EDA workflow👇
EDA is incredibly useful. Proper modeling CANNOT happen without it.
The truth:
Stakeholders NEED it far more than modeling.
EDA empowers the analyst with knowledge about the data, which then moderates the #machinelearning pipeline
While #pandas and #matplotlib are key to good EDA in #python, the real difference are the QUESTIONS you ask to your dataset.
As in all things, these tools are just tools. The real weapon is the analyst. You are in control, not the dataset.
This is where domain > technical knowledge. If you know how to use the basics of pandas and plt, and you are a domain expert YOU ARE DONE.
You are literally a Sherlock Holmes of data. You have the power to query it and get your answers.
This is what clients pay for.
Here's the template I follow. This is the skeleton of it, and I always expand it depending on the data.
It revolves around pandas, matplotlib and seaborn. You rarely need anything else.
Here we go.
1. Understand your data
.describe()
.info()
.isna()
.dtypes
.shape
Get an idea of what's in this dataset. How many categorical variables? How many empty/missing values? And so on
2. Preparation and transformation
- drop useless columns
- rename columns
- handle missing values
- handle duplication
- create new features
This is the first step of the featuring engineering process. You'll get back to this if you want to add information to your model.
3. Univariate analysis
Iterate through each and every relevant variable and get basic information such as
- .hist()
- .value_counts()
- .skew()
- .kurt()
This is the first step of outlier detection. Here's is where you get up and personal with each variable
4. Multivariate analysis
Now you use seaborn to do scatterplots and pairplots
the first allow you to plot 2 variables against each other and understand how they move together.
Pairplots do this for all variables, all at the same time.
Play with the hue parameter in sns
This is it. So let me summarize my basic EDA process.
1. Understand the data 2. Prep & transformation 3. Univariate 4. and multivariate analysis
Expand this as you please. Drop a like and a RT and follow for more #datascience and #machinelearning threads like this 👊
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Have you ever used sklearn's pipeline class to enhance your analysis?
While not mandatory, pipelines bring important benefits if implemented in our code base.
Here is a short thread of why you should use pipelines to improve your #machinelearning work 👇
In #datascience and machine learning, a pipeline is a set of sequential steps that allows us to control the flow of data.
They are very useful as they make our code cleaner, more scalable and readable. They are used to organize the various phases of a project.
Implementing pipelines is not mandatory but has significant advantages, such as
- cleaner code
- less room for error
- implemented like a typical model with .fit()