Latest Twitter Threads by @AnderRV_ on Thread Reader App

Jun 8, 2022 • 7 tweets • 2 min read

Avoid block when web scraping by Rotating Proxies. Learn how to build a simple but effective proxy rotator using Python.

Simplified, pick at random from an IP pool where each proxy is available and health-checked.

Start by getting a list of proxies. For the demo, we'll use free proxies from an online list.

Then, check which ones are working. We'll call a simple page that returns the caller IP in plain text. Enough to validate that the proxy is up and running.

zenrows.com/blog/how-to-ro…

May 17, 2022 • 5 tweets • 2 min read

How to speed up web scraping? By using concurrency!

Send multiple requests simultaneously and take advantage of async programming.

Learn the basics in Python and step up your scraping.

You'll start by writing a demo scraper. Next, we'll go towards concurrency step-by-step.

First, build a simple script using the asyncio library. The lib offers the functionality we need, allowing us to start several processes and wait for them to finish.

zenrows.com/blog/speed-up-…

Jan 19, 2022 • 8 tweets • 4 min read

New post! Web Scraping with #Python 101

Learn how to build a web scraper with Python using Requests and BeautifulSoup libraries. We will cover, step-by-step, a scraping process on a job board.

zenrows.com/blog/web-scrap… #WebScraping might be divided into for main steps:

1. Explore the target site before coding
2. Retrieve the content (HTML)
3. Extract the data you need with selectors
4. Transform and store it for its use

Dec 29, 2021 • 7 tweets • 3 min read

3 days to finish the year, and I decided to do a countdown with the TOP 3 blog posts I've written in 2021. And some context. 🧵

Direct to #3: DOs and DON'Ts of Web #Scraping

zenrows.com/blog/dos-and-d… Published December 21, the last one of the year, and straight to #3! No way we could have seen it coming.

SEO wasn't relevant here, no time for it to work either. #GoogleDiscover launched us there in our official blog. But even in other sites we publish, it has great numbers.

Dec 21, 2021 • 10 tweets • 3 min read

New post! DOs and DON'Ts of Web #Scraping

Learn how to create better web scrapers by following best practices and avoiding common mistakes. Choose the right approach for the job thanks to these tips.

zenrows.com/blog/dos-and-d… 1. DO Rotate IPs

The most common anti-scraping solution is to ban by IP. By using proxies, you'll avoid showing the same IP in every request and thus increasing your chances of success. You can code your own or use a Rotating Proxy like zenrows.com.

Nov 30, 2021 • 8 tweets • 2 min read

New post! Web Scraping with Selenium in Python.

Learn how to navigate and scrape websites using @SeleniumHQ in #Python, even dynamic content, thanks to Javascript Rendering and other available features.

zenrows.com/blog/web-scrap… After installing and launching the basics, we will start selecting elements using several ways. That will allow us to interact with them to click or fill in forms.

For the particular case of infinite scroll or lazy loaded content, we'll take advantage of sending keyboard inputs.

Oct 27, 2021 • 4 tweets • 2 min read

New post! Web #Scraping: Intercepting XHR Requests.

Take advantage of XHR requests and scrape websites content without any effort. No need for fickle HTML or CSS selectors, API endpoints tend to remain stable.

zenrows.com/blog/web-scrap… @playwrightweb and other headless browsers allow response/request interception. We can take advantage and inspect them easily.

Those responses usually come already formatted and structured.

Sep 29, 2021 • 7 tweets • 3 min read

New post! Blocking Resources in Playwright 🚧

Save time and money by downloading only the essential resources while web scraping or testing.

zenrows.com/blog/blocking-… Learn how to use @playwrightweb in #python to avoid CSS files, images, or Javascript from loading and executing.

You can use glob ("**/*.svg") or #regex (r"\.(jpg|png|svg)$").

Sep 1, 2021 • 7 tweets • 2 min read

New post! Web Scraping with #Javascript and #NodeJs.
Learn how to build a web scraper, add anti-blocking techniques, a headless browser, and parallelize requests with a queue.
zenrows.com/blog/web-scrap… You'll first build a web scraper using Axios and Cheerio, then a headless browser - Playwright.
Start by getting the HTML and parsing the content for new links to follow, applying Cheerio and CSS Selectors. Extract also content similarly.

Jul 29, 2021 • 8 tweets • 2 min read

New post! Stealth Web Scraping in #Python: Avoid Blocking Like a Ninja. We share the best techniques for massive scale scraping.
From the basic, such as avoiding rate limits or adding proxies, to more complex as full set of headers or behavioral patterns.
zenrows.com/blog/stealth-w… For the basic defensive protections, rotating proxies with the correct headers should be enough.

For a bit more complex ones, maybe residential IPs are necessary.

Captchas can be solved nowadays, but it is best to bypass them. The same applies to login or paywalls.

Jun 8, 2021 • 4 tweets • 2 min read

I've been busy this past months with a new project and here is one of the results: I published my first ever public blog post.
zenrows.com/blog/collectin… We collected data from almost 3000 houses in Bilbao and used a heatmap to show the density by price per m2.
The data proceeds directly from a well-known real estate website, and we obtained it using ZenRows Tasks. Which is the project I've been working on app.zenrows.com/register?task=…

Share this page!

Enter URL or ID to Unroll