Learn how to create better web scrapers by following best practices and avoiding common mistakes. Choose the right approach for the job thanks to these tips.
The most common anti-scraping solution is to ban by IP. By using proxies, you'll avoid showing the same IP in every request and thus increasing your chances of success. You can code your own or use a Rotating Proxy like zenrows.com.
2. DO Use Custom User-Agent
Overwrite your client's User-Agent, or you'll risk sending something like "curl/7.74.0".
But always sending the same UA might be suspicious, too, so you need a vast and updated list.
3. DO Research Target Content
Look at the source before coding. Many sites expose data through metadata, attributes, or hidden inputs. There are options beyond CSS selectors and are usually more maintainable. Search also for XHR requests.
4. DO Parallelize Requests
To scale up, sequential requests will not be enough.
The next step is to get several URLs simultaneously and have a queue that allows adding items to be processed later.
5. DON'T Use Headless Browsers for Everything
They are great but not a silver bullet. They bring a resource overhead and slow down the scraping process.
Test first plain HTML and only go to Headless Browsers if necessary.
6. DON'T Couple Code to Target
Apply Software Engineering good practices.
Separate general crawling code (get HTML, parse it, enqueue links) from the target-specific (selectors, scraped data, URL structure).
7. DON'T Take Down your Target Site
Be mindful of the scale of your scraping and the size of your targets. It's not the same as Amazon or a small store.
Mainly two rules: do not scrape disallowed pages and obey Crawl-Delay.
8. DON'T Mix Headers from Different Browsers
Browsers send several headers with a set format that varies from version to version. And modern anti-bot solutions check those.
The only solution is to have huge lists with full-set headers, not just UA.
Rotating IPs and good headers will allow you to scrape most websites. Use headless browsers only when necessary and apply Software Engineering good practices.
You probably don't need 100% accuracy - settle for the level that works for you. Overengineering might turn against you.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Published December 21, the last one of the year, and straight to #3! No way we could have seen it coming.
SEO wasn't relevant here, no time for it to work either. #GoogleDiscover launched us there in our official blog. But even in other sites we publish, it has great numbers.
Probably the longest and with more code in the whole lot. Published on September 1, the primary source in Google through SEO. Many interesting keywords in high positions.
Learn how to navigate and scrape websites using @SeleniumHQ in #Python, even dynamic content, thanks to Javascript Rendering and other available features.
After installing and launching the basics, we will start selecting elements using several ways. That will allow us to interact with them to click or fill in forms.
For the particular case of infinite scroll or lazy loaded content, we'll take advantage of sending keyboard inputs.
Another typical use case is to wait for an element to be present. That might happen in dynamic pages loaded via XHR. The initial HTML is just a skeleton with no actual content.
We can wait for the content to be present before starting the data extraction.
New post! Web #Scraping: Intercepting XHR Requests.
Take advantage of XHR requests and scrape websites content without any effort. No need for fickle HTML or CSS selectors, API endpoints tend to remain stable.
New post! Web Scraping with #Javascript and #NodeJs.
Learn how to build a web scraper, add anti-blocking techniques, a headless browser, and parallelize requests with a queue. zenrows.com/blog/web-scrap…
You'll first build a web scraper using Axios and Cheerio, then a headless browser - Playwright.
Start by getting the HTML and parsing the content for new links to follow, applying Cheerio and CSS Selectors. Extract also content similarly.
Follow the gathered links and start a loop that will iterate over all links we find.
To avoid problems, set a maximum limit and store a list with the already visited URLs to prevent duplicates.
New post! Stealth Web Scraping in #Python: Avoid Blocking Like a Ninja. We share the best techniques for massive scale scraping.
From the basic, such as avoiding rate limits or adding proxies, to more complex as full set of headers or behavioral patterns. zenrows.com/blog/stealth-w…
For the basic defensive protections, rotating proxies with the correct headers should be enough.
For a bit more complex ones, maybe residential IPs are necessary.
Captchas can be solved nowadays, but it is best to bypass them. The same applies to login or paywalls.
Sending real-world User-Agents is important. But not enough, since there are other headers involved, i.e. sec-ch-ua or sec-fetch-dest.
They all should be used together, to avoid suspicion.