Tweet

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AnderRV_

Ander

@AnderRV_

21 Dec

New post! DOs and DON'Ts of Web #Scraping

Learn how to create better web scrapers by following best practices and avoiding common mistakes. Choose the right approach for the job thanks to these tips.

zenrows.com/blog/dos-and-d…

1. DO Rotate IPs

The most common anti-scraping solution is to ban by IP. By using proxies, you'll avoid showing the same IP in every request and thus increasing your chances of success. You can code your own or use a Rotating Proxy like zenrows.com.

2. DO Use Custom User-Agent

Overwrite your client's User-Agent, or you'll risk sending something like "curl/7.74.0".

But always sending the same UA might be suspicious, too, so you need a vast and updated list.

Read 10 tweets

Ander

@AnderRV_

30 Nov

@SeleniumHQ

New post! Web Scraping with Selenium in Python.

Learn how to navigate and scrape websites using @SeleniumHQ in #Python, even dynamic content, thanks to Javascript Rendering and other available features.

zenrows.com/blog/web-scrap…

After installing and launching the basics, we will start selecting elements using several ways. That will allow us to interact with them to click or fill in forms.

For the particular case of infinite scroll or lazy loaded content, we'll take advantage of sending keyboard inputs.

Another typical use case is to wait for an element to be present. That might happen in dynamic pages loaded via XHR. The initial HTML is just a skeleton with no actual content.

We can wait for the content to be present before starting the data extraction.

Read 8 tweets

Ander

@AnderRV_

27 Oct

New post! Web #Scraping: Intercepting XHR Requests.

Take advantage of XHR requests and scrape websites content without any effort. No need for fickle HTML or CSS selectors, API endpoints tend to remain stable.

zenrows.com/blog/web-scrap…

@playwrightweb

@playwrightweb and other headless browsers allow response/request interception. We can take advantage and inspect them easily.

Those responses usually come already formatted and structured.

We covered auction.com, twitter.com, and nseindia.com as examples, but the opportunities are infinite.

And not just for the first load, but from any subsequent browsing. The same rules apply.

Read 4 tweets

Ander

@AnderRV_

29 Sep

New post! Blocking Resources in Playwright 🚧

Save time and money by downloading only the essential resources while web scraping or testing.

zenrows.com/blog/blocking-…

@playwrightweb

Learn how to use @playwrightweb in #python to avoid CSS files, images, or Javascript from loading and executing.

You can use glob ("**/*.svg") or #regex (r"\.(jpg|png|svg)$").

For more general use, you can access requests resource type, for example image:
route.request.resource_type == "image"

And being even more aggressive, allow only documents:
route.request.resource_type != "document"

playwright.dev/python/docs/ne…

Read 7 tweets

Ander

@AnderRV_

1 Sep

New post! Web Scraping with #Javascript and #NodeJs.
Learn how to build a web scraper, add anti-blocking techniques, a headless browser, and parallelize requests with a queue.
zenrows.com/blog/web-scrap…

You'll first build a web scraper using Axios and Cheerio, then a headless browser - Playwright.
Start by getting the HTML and parsing the content for new links to follow, applying Cheerio and CSS Selectors. Extract also content similarly.

Follow the gathered links and start a loop that will iterate over all links we find.
To avoid problems, set a maximum limit and store a list with the already visited URLs to prevent duplicates.

Read 7 tweets

Ander

@AnderRV_

29 Jul

New post! Stealth Web Scraping in #Python: Avoid Blocking Like a Ninja. We share the best techniques for massive scale scraping.
From the basic, such as avoiding rate limits or adding proxies, to more complex as full set of headers or behavioral patterns.
zenrows.com/blog/stealth-w…

For the basic defensive protections, rotating proxies with the correct headers should be enough.

For a bit more complex ones, maybe residential IPs are necessary.

Captchas can be solved nowadays, but it is best to bypass them. The same applies to login or paywalls.

Sending real-world User-Agents is important. But not enough, since there are other headers involved, i.e. sec-ch-ua or sec-fetch-dest.

They all should be used together, to avoid suspicion.

Read 8 tweets

Share this page!

Ander

Try unrolling a thread yourself!

More from @AnderRV_

Ander

Ander

Ander

Ander

Ander

Ander

Did Thread Reader help you today?

Like this author's thread?