Data scientist. Technical writer @ https://t.co/W9D4Xxik9p and https://t.co/2riIqO5118
Oct 10, 2022 • 11 tweets • 5 min read
Data analysts should be able to efficiently build datasets using the content on the web.
Here I will show you a use case for a small script I created that allows the user to build corpora of textual information from online blogs.
People working in #seo can definitely use this👇
It is a #Python script that efficiently organizes data in a #pandas dataframe for ease of use and readability.
It leverages Trafilatura, an open source library that is able to read and follow links from the website's sitemap and identify the main content of a page.
May 4, 2022 • 10 tweets • 3 min read
[Data Analysis] 🧵
Exploratory data analysis is a fundamental step in any analysis work. You don't have to be a data scientist and be proficient at modeling to be a useful asset to your client if you can do great EDA.
Here's a template of a basic yet powerful EDA workflow👇
EDA is incredibly useful. Proper modeling CANNOT happen without it.
The truth:
Stakeholders NEED it far more than modeling.
EDA empowers the analyst with knowledge about the data, which then moderates the #machinelearning pipeline
May 2, 2022 • 10 tweets • 3 min read
[ML tools & tips] 🧵
Have you ever used sklearn's pipeline class to enhance your analysis?
While not mandatory, pipelines bring important benefits if implemented in our code base.
Here is a short thread of why you should use pipelines to improve your #machinelearning work 👇
In #datascience and machine learning, a pipeline is a set of sequential steps that allows us to control the flow of data.
They are very useful as they make our code cleaner, more scalable and readable. They are used to organize the various phases of a project.