Ander Profile picture
Jan 19 8 tweets 4 min read
New post! Web Scraping with #Python 101

Learn how to build a web scraper with Python using Requests and BeautifulSoup libraries. We will cover, step-by-step, a scraping process on a job board.

zenrows.com/blog/web-scrap…
#WebScraping might be divided into for main steps:

1. Explore the target site before coding
2. Retrieve the content (HTML)
3. Extract the data you need with selectors
4. Transform and store it for its use
1/ Understand the page you're trying to scrape, not just the content but the structure.

Use DevTools to take a look and inspect its content. What can be scraped, and how is it displayed inside the page.
2/ Obtain the HTML, in our example with Requests. We will focus on static content for simplicity.

The first problems with blocks might appear in this step. The first thing to try: smart rotating proxies such as zenrows.com
3/ Extract the data you are looking for.

Using BeautifulSoup, a library "for pulling data out of HTML," you can select what pieces of information you want to get.

You can use CSS selectors or other solutions like metadata or rich snippets.
4/ Store extracted data in a practical format for your use case, for example, CSV files.

We will use Pandas library to easily convert from an array with results to a file that anyone can open on Excel.
This is just the tip of the iceberg, and there are other things to take into account when building a robust solution:

- Error handling
- Scaling up (more pages, different domains)
- Anti-bot software
Visit the blog post for more details with code examples.

We scrape remotive.io as an example and store some basic info on a CSV file.

zenrows.com/blog/web-scrap…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ander

Ander Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @AnderRV_

Dec 29, 2021
3 days to finish the year, and I decided to do a countdown with the TOP 3 blog posts I've written in 2021. And some context. 🧵

Direct to #3: DOs and DON'Ts of Web #Scraping

zenrows.com/blog/dos-and-d…
Published December 21, the last one of the year, and straight to #3! No way we could have seen it coming.

SEO wasn't relevant here, no time for it to work either. #GoogleDiscover launched us there in our official blog. But even in other sites we publish, it has great numbers.
Coming in at #2 is Web Scraping with #Javascript and #NodeJS

Probably the longest and with more code in the whole lot. Published on September 1, the primary source in Google through SEO. Many interesting keywords in high positions.

zenrows.com/blog/web-scrap…
Read 7 tweets
Dec 21, 2021
New post! DOs and DON'Ts of Web #Scraping

Learn how to create better web scrapers by following best practices and avoiding common mistakes. Choose the right approach for the job thanks to these tips.

zenrows.com/blog/dos-and-d…
1. DO Rotate IPs

The most common anti-scraping solution is to ban by IP. By using proxies, you'll avoid showing the same IP in every request and thus increasing your chances of success. You can code your own or use a Rotating Proxy like zenrows.com.
2. DO Use Custom User-Agent

Overwrite your client's User-Agent, or you'll risk sending something like "curl/7.74.0".

But always sending the same UA might be suspicious, too, so you need a vast and updated list.
Read 10 tweets
Nov 30, 2021
New post! Web Scraping with Selenium in Python.

Learn how to navigate and scrape websites using @SeleniumHQ in #Python, even dynamic content, thanks to Javascript Rendering and other available features.

zenrows.com/blog/web-scrap…
After installing and launching the basics, we will start selecting elements using several ways. That will allow us to interact with them to click or fill in forms.

For the particular case of infinite scroll or lazy loaded content, we'll take advantage of sending keyboard inputs.
Another typical use case is to wait for an element to be present. That might happen in dynamic pages loaded via XHR. The initial HTML is just a skeleton with no actual content.

We can wait for the content to be present before starting the data extraction.
Read 8 tweets
Oct 27, 2021
New post! Web #Scraping: Intercepting XHR Requests.

Take advantage of XHR requests and scrape websites content without any effort. No need for fickle HTML or CSS selectors, API endpoints tend to remain stable.

zenrows.com/blog/web-scrap…
@playwrightweb and other headless browsers allow response/request interception. We can take advantage and inspect them easily.

Those responses usually come already formatted and structured.
We covered auction.com, twitter.com, and nseindia.com as examples, but the opportunities are infinite.

And not just for the first load, but from any subsequent browsing. The same rules apply.
Read 4 tweets
Sep 29, 2021
New post! Blocking Resources in Playwright 🚧

Save time and money by downloading only the essential resources while web scraping or testing.

zenrows.com/blog/blocking-…
Learn how to use @playwrightweb in #python to avoid CSS files, images, or Javascript from loading and executing.

You can use glob ("**/*.svg") or #regex (r"\.(jpg|png|svg)$").
For more general use, you can access requests resource type, for example image:
route.request.resource_type == "image"

And being even more aggressive, allow only documents:
route.request.resource_type != "document"

playwright.dev/python/docs/ne…
Read 7 tweets
Sep 1, 2021
New post! Web Scraping with #Javascript and #NodeJs.
Learn how to build a web scraper, add anti-blocking techniques, a headless browser, and parallelize requests with a queue.
zenrows.com/blog/web-scrap…
You'll first build a web scraper using Axios and Cheerio, then a headless browser - Playwright.
Start by getting the HTML and parsing the content for new links to follow, applying Cheerio and CSS Selectors. Extract also content similarly.
Follow the gathered links and start a loop that will iterate over all links we find.
To avoid problems, set a maximum limit and store a list with the already visited URLs to prevent duplicates.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(