Discover and read the best of Twitter Threads about #scraping

Most recents (7)

Data analysts should be able to efficiently build datasets using the content on the web.

Here I will show you a use case for a small script I created that allows the user to build corpora of textual information from online blogs.

People working in #seo can definitely use this👇
It is a #Python script that efficiently organizes data in a #pandas dataframe for ease of use and readability.

It leverages Trafilatura, an open source library that is able to read and follow links from the website's sitemap and identify the main content of a page.
This is particularly important because each website has its own structure and it is often difficult to grab the main content as opposed, for instance, to the sidebar.

Trafilatura uses a series of heuristics to find the main text of the page, and it works pretty damn good.
Read 11 tweets
Twitter Mining & Web Scraping Projects using Pytho🐍

Thread: 🧵

#Python #pythonprojects #Scraping #Mining
Mining Twitter Data with Python

1: Collecting Data (this article)
2: Text Pre-processing
3: Term Frequencies
4: Rugby and Term Co-Occurrences
5: Data Visualisation Basics
6: Sentiment Analysis Basics
7: Geolocation and Interactive Maps

Web Scraping with Scrapy and MongoDB

Python program to scrape data from Stack Overflow to grab new questions (question title and URL).
Scraped data should then be stored in MongoDB.

Read 5 tweets
3 days to finish the year, and I decided to do a countdown with the TOP 3 blog posts I've written in 2021. And some context. 🧵

Direct to #3: DOs and DON'Ts of Web #Scraping…
Published December 21, the last one of the year, and straight to #3! No way we could have seen it coming.

SEO wasn't relevant here, no time for it to work either. #GoogleDiscover launched us there in our official blog. But even in other sites we publish, it has great numbers.
Coming in at #2 is Web Scraping with #Javascript and #NodeJS

Probably the longest and with more code in the whole lot. Published on September 1, the primary source in Google through SEO. Many interesting keywords in high positions.…
Read 7 tweets
New post! DOs and DON'Ts of Web #Scraping

Learn how to create better web scrapers by following best practices and avoiding common mistakes. Choose the right approach for the job thanks to these tips.…
1. DO Rotate IPs

The most common anti-scraping solution is to ban by IP. By using proxies, you'll avoid showing the same IP in every request and thus increasing your chances of success. You can code your own or use a Rotating Proxy like
2. DO Use Custom User-Agent

Overwrite your client's User-Agent, or you'll risk sending something like "curl/7.74.0".

But always sending the same UA might be suspicious, too, so you need a vast and updated list.
Read 10 tweets
New post! Web #Scraping: Intercepting XHR Requests.

Take advantage of XHR requests and scrape websites content without any effort. No need for fickle HTML or CSS selectors, API endpoints tend to remain stable.…
@playwrightweb and other headless browsers allow response/request interception. We can take advantage and inspect them easily.

Those responses usually come already formatted and structured.
We covered,, and as examples, but the opportunities are infinite.

And not just for the first load, but from any subsequent browsing. The same rules apply.
Read 4 tweets
è difficile capire le rivelazioni del #Foglio senza accedere ai documenti. Da quello che riesco a capire i dati raccolti erano semplicemente estratti dai social network attraverso #scraping e forse dal #deepweb: la raccolta di dati pubblici è semplice doxing
se, però, dovesse emergere che i dati raccolti da società cinese su personalità della politica, economia, ecc includono informazioni riservate, allora è diverso. Difficile capire con queste poche informazioni
mi fa piacere vedere che si levano voci dalla politica per chiedere chiarezza su raccolta di dati da parte della società cinese. Ricordiamo vergognoso silenzio della politica su nostre rivelazioni intercettazioni #NSA. E inquietante insabbiamento inchiesta procura #Roma
Read 3 tweets

Related hashtags

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!