Software Engineer & Co-Founder @ZenRowsHQ | Building https://t.co/IZ3mO2OKRR
I tweet about data collection, scraping, and how we are building our project 🔥
Jun 8, 2022 • 7 tweets • 2 min read
Avoid block when web scraping by Rotating Proxies. Learn how to build a simple but effective proxy rotator using Python.
Simplified, pick at random from an IP pool where each proxy is available and health-checked.
Start by getting a list of proxies. For the demo, we'll use free proxies from an online list.
Then, check which ones are working. We'll call a simple page that returns the caller IP in plain text. Enough to validate that the proxy is up and running.
How to speed up web scraping? By using concurrency!
Send multiple requests simultaneously and take advantage of async programming.
Learn the basics in Python and step up your scraping.
You'll start by writing a demo scraper. Next, we'll go towards concurrency step-by-step.
First, build a simple script using the asyncio library. The lib offers the functionality we need, allowing us to start several processes and wait for them to finish.
Learn how to build a web scraper with Python using Requests and BeautifulSoup libraries. We will cover, step-by-step, a scraping process on a job board.
1. Explore the target site before coding 2. Retrieve the content (HTML) 3. Extract the data you need with selectors 4. Transform and store it for its use
Dec 29, 2021 • 7 tweets • 3 min read
3 days to finish the year, and I decided to do a countdown with the TOP 3 blog posts I've written in 2021. And some context. 🧵
zenrows.com/blog/dos-and-d…
Published December 21, the last one of the year, and straight to #3! No way we could have seen it coming.
SEO wasn't relevant here, no time for it to work either. #GoogleDiscover launched us there in our official blog. But even in other sites we publish, it has great numbers.
Learn how to create better web scrapers by following best practices and avoiding common mistakes. Choose the right approach for the job thanks to these tips.
The most common anti-scraping solution is to ban by IP. By using proxies, you'll avoid showing the same IP in every request and thus increasing your chances of success. You can code your own or use a Rotating Proxy like zenrows.com.
Nov 30, 2021 • 8 tweets • 2 min read
New post! Web Scraping with Selenium in Python.
Learn how to navigate and scrape websites using @SeleniumHQ in #Python, even dynamic content, thanks to Javascript Rendering and other available features.
zenrows.com/blog/web-scrap…
After installing and launching the basics, we will start selecting elements using several ways. That will allow us to interact with them to click or fill in forms.
For the particular case of infinite scroll or lazy loaded content, we'll take advantage of sending keyboard inputs.
Oct 27, 2021 • 4 tweets • 2 min read
New post! Web #Scraping: Intercepting XHR Requests.
Take advantage of XHR requests and scrape websites content without any effort. No need for fickle HTML or CSS selectors, API endpoints tend to remain stable.
You can use glob ("**/*.svg") or #regex (r"\.(jpg|png|svg)$").
Sep 1, 2021 • 7 tweets • 2 min read
New post! Web Scraping with #Javascript and #NodeJs.
Learn how to build a web scraper, add anti-blocking techniques, a headless browser, and parallelize requests with a queue. zenrows.com/blog/web-scrap…
You'll first build a web scraper using Axios and Cheerio, then a headless browser - Playwright.
Start by getting the HTML and parsing the content for new links to follow, applying Cheerio and CSS Selectors. Extract also content similarly.
Jul 29, 2021 • 8 tweets • 2 min read
New post! Stealth Web Scraping in #Python: Avoid Blocking Like a Ninja. We share the best techniques for massive scale scraping.
From the basic, such as avoiding rate limits or adding proxies, to more complex as full set of headers or behavioral patterns. zenrows.com/blog/stealth-w…
For the basic defensive protections, rotating proxies with the correct headers should be enough.
For a bit more complex ones, maybe residential IPs are necessary.
Captchas can be solved nowadays, but it is best to bypass them. The same applies to login or paywalls.
Jun 8, 2021 • 4 tweets • 2 min read
I've been busy this past months with a new project and here is one of the results: I published my first ever public blog post. zenrows.com/blog/collectin…
We collected data from almost 3000 houses in Bilbao and used a heatmap to show the density by price per m2.
The data proceeds directly from a well-known real estate website, and we obtained it using ZenRows Tasks. Which is the project I've been working on app.zenrows.com/register?task=…