Have you ever used sklearn's pipeline class to enhance your analysis?
While not mandatory, pipelines bring important benefits if implemented in our code base.
Here is a short thread of why you should use pipelines to improve your #machinelearning work 👇
In #datascience and machine learning, a pipeline is a set of sequential steps that allows us to control the flow of data.
They are very useful as they make our code cleaner, more scalable and readable. They are used to organize the various phases of a project.
Implementing pipelines is not mandatory but has significant advantages, such as
- cleaner code
- less room for error
- implemented like a typical model with .fit()
Here's an example in #Python. Look how elegant and readable the code is.
This has noticeable effects on your team members and stakeholders, as clarity drives efficiency and progress.
The pipeline is built in the following way.
1. Take X_train and y_train and apply scaling 2. Feed the scaled data to the SVC
A pipeline moves data from the top to the bottom, like a pipe in the real world.
This is what I meant by "controlling the flow of data"
Here are just a few applications of what a pipeline can do for us:
1. apply transformations to specific columns 2. imputation / handle missing values 3. feature engineering 4. feed data to models
and more. You can read more in the official documentation.
You can build pipelines for basically all kinds of data science problems.
Sklearn's API revolves around specific methods in the model's classes, so while we can use pipelines in almost all sklearn's objects, we need to be careful if we want to create our own.
As I said, pipelines are not mandatory nor you should force yourself to use them if you don't think you need them.
They are just one of the many tools that sklearn provides to us analysts.
If you have never tried using pipelines, I suggest you try! You'll be amazed how tidy and clean your notebooks will become.
Also, you can export the whole pipeline in a pickle file like you would with any other model
On a final note, the famous Python auto-ml library #Pycaret uses pipelines for basically all of its tasks - this should give you an idea of the power of this approach!
Thank you for your attention!
Until next time 👋
• • •
Missing some Tweet in this thread? You can try to
force a refresh
[Data Analysis] 🧵
Exploratory data analysis is a fundamental step in any analysis work. You don't have to be a data scientist and be proficient at modeling to be a useful asset to your client if you can do great EDA.
Here's a template of a basic yet powerful EDA workflow👇
EDA is incredibly useful. Proper modeling CANNOT happen without it.
The truth:
Stakeholders NEED it far more than modeling.
EDA empowers the analyst with knowledge about the data, which then moderates the #machinelearning pipeline
While #pandas and #matplotlib are key to good EDA in #python, the real difference are the QUESTIONS you ask to your dataset.
As in all things, these tools are just tools. The real weapon is the analyst. You are in control, not the dataset.