More from @AnderRV_

Ander

@AnderRV_

29 Sep

New post! Blocking Resources in Playwright 🚧

Save time and money by downloading only the essential resources while web scraping or testing.

zenrows.com/blog/blocking-…

@playwrightweb

Learn how to use @playwrightweb in #python to avoid CSS files, images, or Javascript from loading and executing.

You can use glob ("**/*.svg") or #regex (r"\.(jpg|png|svg)$").

For more general use, you can access requests resource type, for example image:
route.request.resource_type == "image"

And being even more aggressive, allow only documents:
route.request.resource_type != "document"

playwright.dev/python/docs/ne…

Read 7 tweets

Ander

@AnderRV_

1 Sep

New post! Web Scraping with #Javascript and #NodeJs.
Learn how to build a web scraper, add anti-blocking techniques, a headless browser, and parallelize requests with a queue.
zenrows.com/blog/web-scrap…

You'll first build a web scraper using Axios and Cheerio, then a headless browser - Playwright.
Start by getting the HTML and parsing the content for new links to follow, applying Cheerio and CSS Selectors. Extract also content similarly.

Follow the gathered links and start a loop that will iterate over all links we find.
To avoid problems, set a maximum limit and store a list with the already visited URLs to prevent duplicates.

Read 7 tweets

Ander

@AnderRV_

29 Jul

New post! Stealth Web Scraping in #Python: Avoid Blocking Like a Ninja. We share the best techniques for massive scale scraping.
From the basic, such as avoiding rate limits or adding proxies, to more complex as full set of headers or behavioral patterns.
zenrows.com/blog/stealth-w…

For the basic defensive protections, rotating proxies with the correct headers should be enough.

For a bit more complex ones, maybe residential IPs are necessary.

Captchas can be solved nowadays, but it is best to bypass them. The same applies to login or paywalls.

Sending real-world User-Agents is important. But not enough, since there are other headers involved, i.e. sec-ch-ua or sec-fetch-dest.

They all should be used together, to avoid suspicion.

Read 8 tweets

Ander

@AnderRV_

8 Jun

I've been busy this past months with a new project and here is one of the results: I published my first ever public blog post.
zenrows.com/blog/collectin…

We collected data from almost 3000 houses in Bilbao and used a heatmap to show the density by price per m2.
The data proceeds directly from a well-known real estate website, and we obtained it using ZenRows Tasks. Which is the project I've been working on app.zenrows.com/register?task=…

The data in the demo is incomplete to reduce its size, so we will published an example dataset here github.com/ZenRows/house-…

Read 4 tweets

Share this page!

Ander

Try unrolling a thread yourself!

More from @AnderRV_

Ander

Ander

Ander

Ander

Did Thread Reader help you today?

Like this author's thread?