Tweet

Ander

1 Sep, 7 tweets, 2 min read

New post! Web Scraping with #Javascript and #NodeJs.
Learn how to build a web scraper, add anti-blocking techniques, a headless browser, and parallelize requests with a queue.
zenrows.com/blog/web-scrap…

You'll first build a web scraper using Axios and Cheerio, then a headless browser - Playwright.
Start by getting the HTML and parsing the content for new links to follow, applying Cheerio and CSS Selectors. Extract also content similarly.

Follow the gathered links and start a loop that will iterate over all links we find.
To avoid problems, set a maximum limit and store a list with the already visited URLs to prevent duplicates.

Next, a couple of techniques to avoid blocks, captchas, and so on: adding proxies and full-set headers.

As said earlier, we can move from Axios to Playwright to load Javascript or async content.
We can do that easily with a function for each and a boolean to switch them.

And lastly, we'll extract the full potential to Javascript: parallel calls.
By creating a queue, we can start crawling every URL we find without needing the previous one to finish.

Parallelization has its own problems, the main one being too many calls to process. To avoid that, we set a concurrency limit.
Meaning that no more than four calls will be running at any given time.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @AnderRV_

Ander

@AnderRV_

29 Sep

New post! Blocking Resources in Playwright 🚧

Save time and money by downloading only the essential resources while web scraping or testing.

zenrows.com/blog/blocking-…

@playwrightweb

Learn how to use @playwrightweb in #python to avoid CSS files, images, or Javascript from loading and executing.

You can use glob ("**/*.svg") or #regex (r"\.(jpg|png|svg)$").

For more general use, you can access requests resource type, for example image:
route.request.resource_type == "image"

And being even more aggressive, allow only documents:
route.request.resource_type != "document"

playwright.dev/python/docs/ne…

Read 7 tweets

Ander

@AnderRV_

29 Jul

New post! Stealth Web Scraping in #Python: Avoid Blocking Like a Ninja. We share the best techniques for massive scale scraping.
From the basic, such as avoiding rate limits or adding proxies, to more complex as full set of headers or behavioral patterns.
zenrows.com/blog/stealth-w…

For the basic defensive protections, rotating proxies with the correct headers should be enough.

For a bit more complex ones, maybe residential IPs are necessary.

Captchas can be solved nowadays, but it is best to bypass them. The same applies to login or paywalls.

Sending real-world User-Agents is important. But not enough, since there are other headers involved, i.e. sec-ch-ua or sec-fetch-dest.

They all should be used together, to avoid suspicion.

Read 8 tweets

Ander

@AnderRV_

8 Jun

I've been busy this past months with a new project and here is one of the results: I published my first ever public blog post.
zenrows.com/blog/collectin…

We collected data from almost 3000 houses in Bilbao and used a heatmap to show the density by price per m2.
The data proceeds directly from a well-known real estate website, and we obtained it using ZenRows Tasks. Which is the project I've been working on app.zenrows.com/register?task=…

The data in the demo is incomplete to reduce its size, so we will published an example dataset here github.com/ZenRows/house-…

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Ander

Try unrolling a thread yourself!

More from @AnderRV_

Ander

Ander

Ander

Did Thread Reader help you today?

Like this author's thread?