Ander Profile picture
29 Jul, 8 tweets, 2 min read
New post! Stealth Web Scraping in #Python: Avoid Blocking Like a Ninja. We share the best techniques for massive scale scraping.
From the basic, such as avoiding rate limits or adding proxies, to more complex as full set of headers or behavioral patterns.
zenrows.com/blog/stealth-w…
For the basic defensive protections, rotating proxies with the correct headers should be enough.

For a bit more complex ones, maybe residential IPs are necessary.

Captchas can be solved nowadays, but it is best to bypass them. The same applies to login or paywalls.
Sending real-world User-Agents is important. But not enough, since there are other headers involved, i.e. sec-ch-ua or sec-fetch-dest.

They all should be used together, to avoid suspicion.
When using Headless Browsers, headers are important but be careful with Javascript check. If your header does not match `navigator.userAgent`, you can be tagged as a potential bot. Or directly banned, depending on their criteria.
For state-of-the-art solutions, you should take into account even behavioral patterns. It means that real users do not browse through the entire page in less than a second. Or fill a form in an instant.
Moving the mouse, scroll at a human pace, clicking links, and several other actions can confuse the defensive systems.
But repetitive behaviour can be also tracked. Visiting the same pages in order and exactly the same actions, can also be flagged.

Shuffle some URLs, change the order, modify scrolling patters, or add some delays.
These actions are not enough on their own, but a combination of them can be next to impossible to detect.
Sometimes implementing all of these is an overkill, it all depends on the other side.
And above all, be a good internet citizen. Do not use this to perform malicious actions.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ander

Ander Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AnderRV_

29 Sep
New post! Blocking Resources in Playwright 🚧

Save time and money by downloading only the essential resources while web scraping or testing.

zenrows.com/blog/blocking-…
Learn how to use @playwrightweb in #python to avoid CSS files, images, or Javascript from loading and executing.

You can use glob ("**/*.svg") or #regex (r"\.(jpg|png|svg)$").
For more general use, you can access requests resource type, for example image:
route.request.resource_type == "image"

And being even more aggressive, allow only documents:
route.request.resource_type != "document"

playwright.dev/python/docs/ne…
Read 7 tweets
1 Sep
New post! Web Scraping with #Javascript and #NodeJs.
Learn how to build a web scraper, add anti-blocking techniques, a headless browser, and parallelize requests with a queue.
zenrows.com/blog/web-scrap…
You'll first build a web scraper using Axios and Cheerio, then a headless browser - Playwright.
Start by getting the HTML and parsing the content for new links to follow, applying Cheerio and CSS Selectors. Extract also content similarly.
Follow the gathered links and start a loop that will iterate over all links we find.
To avoid problems, set a maximum limit and store a list with the already visited URLs to prevent duplicates.
Read 7 tweets
8 Jun
I've been busy this past months with a new project and here is one of the results: I published my first ever public blog post.
zenrows.com/blog/collectin…
We collected data from almost 3000 houses in Bilbao and used a heatmap to show the density by price per m2.
The data proceeds directly from a well-known real estate website, and we obtained it using ZenRows Tasks. Which is the project I've been working on app.zenrows.com/register?task=…
The data in the demo is incomplete to reduce its size, so we will published an example dataset here github.com/ZenRows/house-…
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(