[ML tools & tips] 🧵

Have you ever used sklearn's pipeline class to enhance your analysis?

While not mandatory, pipelines bring important benefits if implemented in our code base.

Here is a short thread of why you should use pipelines to improve your #machinelearning work 👇
In #datascience and machine learning, a pipeline is a set of sequential steps that allows us to control the flow of data.
They are very useful as they make our code cleaner, more scalable and readable. They are used to organize the various phases of a project.
Implementing pipelines is not mandatory but has significant advantages, such as

- cleaner code
- less room for error
- implemented like a typical model with .fit()
Here's an example in #Python. Look how elegant and readable the code is.
This has noticeable effects on your team members and stakeholders, as clarity drives efficiency and progress. Image
The pipeline is built in the following way.

1. Take X_train and y_train and apply scaling
2. Feed the scaled data to the SVC

A pipeline moves data from the top to the bottom, like a pipe in the real world.

This is what I meant by "controlling the flow of data"
Here are just a few applications of what a pipeline can do for us:

1. apply transformations to specific columns
2. imputation / handle missing values
3. feature engineering
4. feed data to models

and more. You can read more in the official documentation.
You can build pipelines for basically all kinds of data science problems.
Sklearn's API revolves around specific methods in the model's classes, so while we can use pipelines in almost all sklearn's objects, we need to be careful if we want to create our own.
As I said, pipelines are not mandatory nor you should force yourself to use them if you don't think you need them.

They are just one of the many tools that sklearn provides to us analysts.
If you have never tried using pipelines, I suggest you try! You'll be amazed how tidy and clean your notebooks will become.
Also, you can export the whole pipeline in a pickle file like you would with any other model
On a final note, the famous Python auto-ml library #Pycaret uses pipelines for basically all of its tasks - this should give you an idea of the power of this approach!

Thank you for your attention!
Until next time 👋

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Andrea D'Agostino

Andrea D'Agostino Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @theDrewDag

May 4
[Data Analysis] 🧵
Exploratory data analysis is a fundamental step in any analysis work. You don't have to be a data scientist and be proficient at modeling to be a useful asset to your client if you can do great EDA.

Here's a template of a basic yet powerful EDA workflow👇
EDA is incredibly useful. Proper modeling CANNOT happen without it.

The truth:
Stakeholders NEED it far more than modeling.

EDA empowers the analyst with knowledge about the data, which then moderates the #machinelearning pipeline
While #pandas and #matplotlib are key to good EDA in #python, the real difference are the QUESTIONS you ask to your dataset.

As in all things, these tools are just tools. The real weapon is the analyst. You are in control, not the dataset.
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(