It is a #Python script that efficiently organizes data in a #pandas dataframe for ease of use and readability.
It leverages Trafilatura, an open source library that is able to read and follow links from the website's sitemap and identify the main content of a page.
This is particularly important because each website has its own structure and it is often difficult to grab the main content as opposed, for instance, to the sidebar.
Trafilatura uses a series of heuristics to find the main text of the page, and it works pretty damn good.
By leveraging pandas, we can dump Trafilatura's output in a spreadsheet style object in Python.
The function below iterates through a list of websites we pass to and builds a clean, usable dataset for our analysis.
Conveniently, columns are just URL and content of the article.
Once you launch the software, you'll also be able to leverage TQDM, a super cool lib in Python that applies progress bars to our applications.
No drawbacks, just pure clarity. Its a no brainer in almost all cases and it also integrates with pandas.
You might be asking...what software are you talking about? Well I wrote a piece on #medium some time ago where I go through the steps of building this software for your own analysis.
While experienced Python developers and analysts know well that scraping can be done in many different and efficient ways, this is an introductory script that could help new analysts move their first steps in the field of web #scraping and Python.
It goes without saying that you can and should expand this script as you see fit. See it as a template for starting your own project.
Also, Trafilatura does not work with Javascript websites - in that case the output would be empty. This is how it is coded to work.
If you want to scrape content from JS-heavy websites, I suggest you use #playwright.
It is a Selenium on steroids - it is fast to setup and a breeze to get it going. Try it out.
Now go out there and build your own dataset!
If you want to reach to know more, please do. I reply to DMs here and on LinkedIn.
If you want to follow my content creation, sign up on my medium newsletter.
[Data Analysis] 🧵
Exploratory data analysis is a fundamental step in any analysis work. You don't have to be a data scientist and be proficient at modeling to be a useful asset to your client if you can do great EDA.
Here's a template of a basic yet powerful EDA workflow👇
EDA is incredibly useful. Proper modeling CANNOT happen without it.
The truth:
Stakeholders NEED it far more than modeling.
EDA empowers the analyst with knowledge about the data, which then moderates the #machinelearning pipeline
While #pandas and #matplotlib are key to good EDA in #python, the real difference are the QUESTIONS you ask to your dataset.
As in all things, these tools are just tools. The real weapon is the analyst. You are in control, not the dataset.
Have you ever used sklearn's pipeline class to enhance your analysis?
While not mandatory, pipelines bring important benefits if implemented in our code base.
Here is a short thread of why you should use pipelines to improve your #machinelearning work 👇
In #datascience and machine learning, a pipeline is a set of sequential steps that allows us to control the flow of data.
They are very useful as they make our code cleaner, more scalable and readable. They are used to organize the various phases of a project.
Implementing pipelines is not mandatory but has significant advantages, such as
- cleaner code
- less room for error
- implemented like a typical model with .fit()