What is WEB SCRAPING? ๐Ÿคทโ€โ™‚๏ธ

To answer this question, I created a small web scraper for Amazon items.

This is a thread that explains step by step how it works ๐Ÿงต๐Ÿ‘‡

(find the complete code at the end)
1. What is web scraping?

A web scraper is a program that scans a website, and reads information from it, rather than using a public API.

2. Why would you use web scraping?

It can be used to retrieve data from a website when no public API is available.

(1/13)
3. Finding out more about the website

Assuming we want to create a web scraper that retrieves information from an item availability.

(2/13)
We need to inspect the element with the browser debugger.

Here we find out, that there is an element with the id "availability" and then there is a <span> with the availability text (Temporarily out of order).

(3/13)
I checked this on different items, and even on different Amazon websites (.com, .de, .co.uk, etc.) and this is the structure for every item, on every website.

So that's information we can leverage to build out web scraper.

(4/13)
4. Let's start coding!

We need 2 dependencies to make this work:
- node-fetch: Retrieve the HTML from a given URL
- cheerio: Scan HTML code and makes it navigatable

Let's create an empty folder, and add these dependencies:
$ yarn add node-fetch cheerio

(5/13)
Then we create an index.js file and import those 2 dependencies:

const fetch = require('node-fetch')
const cheerio = require('cheerio')

(6/13)
Then, we are going to write 3 functions:

- getPage: Retrieve the HTML from a website
- getAvailability: Get the node we want to read information from
- getAvailabilityText: Get the actual text from the node and sanitize it

(7/13)
Retrieve the HTML from a website:

const getPage = async url => {
const res = await fetch(url, { timeout: 3000 })
return await res.text()
}

We use node-fetch to make a request to the given URL, then return the text() (HTML code)

(8/13)
Get the node to read information from:

const getAvailabiliy = async url => {
const html = await getPage(url)
const $ = cheerio.load(html)
return $('#availability span')
}

Using "#availability span", we access the <span> inside the #availability element.

(9/13)
Get the text from the node and sanitize it:

const getAvailabiliyText = async url => {
const availability = await getAvailabiliy(url)
return availability.text().trim()
}

cheerio.load(html) doesn't just return the text, but other stuff too. We just want the text

(10/13)
To run this, we can do:

const url = 'https://...'
const itemAvailability = await getAvailabiliyText(url)
console.log(`Availability status: ${itemAvailability}`)

Note, that this needs to be in an async function. You can't run this in the Node.js-file root-level.

(11/13)
This is the final code. You can check it out in my GitHub repository:
๐Ÿ”— github.com/themarcba/amazโ€ฆ

(12/13)
I made an NPM package, so you can also install it in your project with NPM or Yarn and try it out:

$ npm install @themarcba/amazon-web-scraper

$ yarn add @themarcba/amazon-web-scraper

(13/13)
If you liked this thread, make sure to give me a follow (@themarcba) in order to be notified about more of this ๐Ÿ˜ƒ

โ€ข โ€ข โ€ข

Missing some Tweet in this thread? You can try to force a refresh
ใ€€

Keep Current with Marc Backes

Marc Backes Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @themarcba

4 Aug
If you're writing a Node app, you might have tasks you want to reoccur periodically. For example, run a cleaning task every Sunday night. Or check for updated weather conditions every day at 4 pm.

A quick walkthrough on how to do that ๐Ÿงต Image
There are some ways to solve this.

You could for example use setInterval() to repeat every X seconds. (Please don't do this ๐Ÿšซ)

Or you could a piece of your app being called with the UNIX-native cron, which you need to set up in every machine you're running this on.

(1/8)
There is a better way โžก๏ธ With the npm module node-cron (npmjs.com/package/node-cโ€ฆ).

As the name suggests, node-cron is based on cron which is widely used in the UNIX world do schedule tasks.

(2/8)
Read 10 tweets
2 Aug
What is a senior developer *REALLY*?

There is a misconception in our industry that the senior developer title is earned by age or time in the company.

I disagree with that approach. Find out what I think a senior developer really is.

๐Ÿงต๐Ÿ‘‡
1. What a senior developer is NOT โ˜๏ธ

๐Ÿ‘‰ People that know everything about a programming language
๐Ÿ‘‰ Know all the answers
๐Ÿ‘‰ The absolute truth

(1/12)
2. Problem-solving ๐Ÿ’ก

๐Ÿ‘‰ Make sure not to introduce unnecessary sources of errors
๐Ÿ‘‰ Create as little friction with the existing system as possible
๐Ÿ‘‰ Think of the bigger picture
๐Ÿ‘‰ Have expandability/reusability in mind
๐Ÿ‘‰ Make decisions about potential trade-offs

(2/12)
Read 14 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(