Data analysts should be able to efficiently build datasets using the content on the web.

Here I will show you a use case for a small script I created that allows the user to build corpora of textual information from online blogs.

People working in #seo can definitely use this👇
It is a #Python script that efficiently organizes data in a #pandas dataframe for ease of use and readability.

It leverages Trafilatura, an open source library that is able to read and follow links from the website's sitemap and identify the main content of a page.
This is particularly important because each website has its own structure and it is often difficult to grab the main content as opposed, for instance, to the sidebar.

Trafilatura uses a series of heuristics to find the main text of the page, and it works pretty damn good.
By leveraging pandas, we can dump Trafilatura's output in a spreadsheet style object in Python.

The function below iterates through a list of websites we pass to and builds a clean, usable dataset for our analysis.

Conveniently, columns are just URL and content of the article.
Once you launch the software, you'll also be able to leverage TQDM, a super cool lib in Python that applies progress bars to our applications.

No drawbacks, just pure clarity. Its a no brainer in almost all cases and it also integrates with pandas.
You might be asking...what software are you talking about? Well I wrote a piece on #medium some time ago where I go through the steps of building this software for your own analysis.

You can find it at this link here 👇

medium.com/mlearning-ai/h…
While experienced Python developers and analysts know well that scraping can be done in many different and efficient ways, this is an introductory script that could help new analysts move their first steps in the field of web #scraping and Python.
It goes without saying that you can and should expand this script as you see fit. See it as a template for starting your own project.

Also, Trafilatura does not work with Javascript websites - in that case the output would be empty. This is how it is coded to work.
If you want to scrape content from JS-heavy websites, I suggest you use #playwright.

It is a Selenium on steroids - it is fast to setup and a breeze to get it going. Try it out.

Now go out there and build your own dataset!
If you want to reach to know more, please do. I reply to DMs here and on LinkedIn.

If you want to follow my content creation, sign up on my medium newsletter.

If you are Italian, visit diariodiunanalista.it and sign up there too
If you want to share your project with me and want my opinion on it, drop me a message. I'll be happy to follow up with you.

Stay tuned 💪

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Andrea D'Agostino

Andrea D'Agostino Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @theDrewDag

May 4
[Data Analysis] 🧵
Exploratory data analysis is a fundamental step in any analysis work. You don't have to be a data scientist and be proficient at modeling to be a useful asset to your client if you can do great EDA.

Here's a template of a basic yet powerful EDA workflow👇
EDA is incredibly useful. Proper modeling CANNOT happen without it.

The truth:
Stakeholders NEED it far more than modeling.

EDA empowers the analyst with knowledge about the data, which then moderates the #machinelearning pipeline
While #pandas and #matplotlib are key to good EDA in #python, the real difference are the QUESTIONS you ask to your dataset.

As in all things, these tools are just tools. The real weapon is the analyst. You are in control, not the dataset.
Read 10 tweets
May 2
[ML tools & tips] 🧵

Have you ever used sklearn's pipeline class to enhance your analysis?

While not mandatory, pipelines bring important benefits if implemented in our code base.

Here is a short thread of why you should use pipelines to improve your #machinelearning work 👇
In #datascience and machine learning, a pipeline is a set of sequential steps that allows us to control the flow of data.
They are very useful as they make our code cleaner, more scalable and readable. They are used to organize the various phases of a project.
Implementing pipelines is not mandatory but has significant advantages, such as

- cleaner code
- less room for error
- implemented like a typical model with .fit()
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(