Simon Willison Profile picture
Mar 14 8 tweets 4 min read
Released shot-scraper 0.9 with a very fun new feature: you can now use it to execute JavaScript against a web page and return the result to the terminal as JSON!
github.com/simonw/shot-sc…       New shot-scraper javascript command for executing Java
Full documentation here: github.com/simonw/shot-sc…

You can scrape pages and return a JSON object with the extracted data:

shot-scraper javascript datasette.io "({
title: document.title,
tagline: document.querySelector('.tagline').innerText
})"
If a JavaScript exception occurs the exit status for the shot-scraper invocation will be 1, which means you can also now use shot-scraper to run basic tests as part of a CI workflow: - name: Test page title   run: |-     shot-scraper javascrip
I expect this to be a really powerful technique for writing scrapers, especially when combined with GitHub Actions: fire up a headless browser, extract data with JavaScript, then write the resulting JSON back to the same repository. Classic git scraping! simonwillison.net/2020/Oct/9/git…
Previous shot-scraper releases are described in this thread
Just built my first git scraper with this! I wanted to keep track of how often stuff from my simonwillison.net site ends up on Hacker News

Here's their page: news.ycombinator.com/from?site=simo…

And here's my new GitHub Actions workflow that scrapes it: github.com/simonw/scrape-…     - name: Install dependencies       run: |         pip in
... it just spotted a new comment! github.com/simonw/scrape-…
Wrote this up on my blog: Scraping web pages from the command-line with shot-scraper
simonwillison.net/2022/Mar/14/sc…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Simon Willison

Simon Willison Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @simonw

Mar 14
Instantly create a GitHub repository to take screenshots of a web page - a write-up of my new shot-scraper-template GitHub repository template simonwillison.net/2022/Mar/14/sh…
Want to take and store a screenshot of a web page?

Go to github.com/simonw/shot-sc…, enter the URL of the page you want in the "description" field, pick a name for your new repo and click the button

There is no step two The form asks you for a rep...
I describe how this works in the blog post: your new repo will run a GitHub Action that creates a "shots.yml" file with the URL from the description, then installs and runs "shot-scraper multi shots.yml" to take the screenshot and write it back to the repo simonwillison.net/2022/Mar/14/sh…
Read 7 tweets
Mar 10
shot-scraper is my new tool for automating screenshots, primarily for documentation but with some devious scraping applications too
simonwillison.net/2022/Mar/10/sh…
It's built on top of @playwrightweb - shot-scraper provides a CLI tool for taking a screenshot of a page (or a portion of a page):

shot-scraper datasette.io -o datasette.png A long screenshot of the wh...
Or of a portion of a page, specified using a CSS selector

shot-scraper simonwillison.net \
-s '#bighead' -o bighead.png A screenshot of just the to...
Read 23 tweets
Feb 2
Whoa. webvm.io runs a full Debian VM entirely in your browser via WebAssembly... and it ships with working Perl, Python, Ruby and Node.js!
It has gcc too! This works:

gcc -o helloworld examples/c/helloworld.c

And it looks like there's a virtual filesystem that stores state in your browser
Combine this with:

- JupyterLite jupyterlite.readthedocs.io/en/latest/_sta…
- SQLime sqlime.org

And it looks like one of the killer apps of WebAssembly is providing 100% safe and reliable teaching environments for people who are just getting started learning complex technologies
Read 7 tweets
Jan 12
Want to know the secret to blogging more often?

Lower your standards!

A post which you don't think is ready yet is a LOT better than a giant folder full of drafts that no-one ever gets to see

(Your readers won't ever know how good the thing you wanted to write would have been)
I like to apply this classic Reid Hoffman startup/product advice to my writing, because the alternative is basically never publishing anything at all
One of the biggest productivity improvements I ever made to my blogging was when I gave up on my desire to finish everything with a sparkling conclusion that ties together the whole post

Now I embrace abruptly ending when I've run out of things to say instead
Read 5 tweets
Jan 11
I've been solving so many documentation problems with @nedbat's cog tool recently - it's fantastic for keeping documentation automatically up-to-date, in Markdown or rST)

Here's a new page of sqlite-utils docs showing --help for every CLI command! sqlite-utils.datasette.io/en/latest/cli-… Partial screenshot of that ...
And here's how it works - I have a cog code block embedded in the .rst file which iterates through the commands and calls --help on each one, then writes the output to the page:
github.com/simonw/sqlite-… .. contents:: :local:  .. [...
Final trick: my GitHub Actions test.yml file calls "cog --check docs/*.rst" to confirm that the cog scripts have been run

If the test fails, I can run "cog -r docs/*.rst" to execute them, then commit the result. github.com/simonw/sqlite-…
Read 7 tweets
Jan 11
What’s new in sqlite-utils - annotated release notes for my SQLite Python utility library and CLI tool, v3.20 and v3.21 simonwillison.net/2022/Jan/11/sq…
A bunch of powerful new features in these releases.

The new --convert option to "sqlite-utils insert" lets you run a Python conversion function against data you are importing from JSON or CSV - and --lines lets you import raw lines of code (e.g. from log files) too
Combining the new --text option with --convert lets you load in a full unstructured/semi-structured file in one go and use a Python fragment to parse it into a list of dictionaries which then get inserted into a table
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(