I built a new tool: s3-ocr, a utility for running OCR (with Amazon Textract) against every PDF file in a S3 bucket and getting the results back as a searchable SQLite database simonwillison.net/2022/Jun/30/s3…
Here's a demo of the database it generated running against 3 PDF files from the Library of Congress Harry Houdini Collection on the Internet Archive s3-ocr-demo.datasette.io/pages/pages?_f…
AWS Textract is /amazing/. I've run it against PDFs containing scanned handwritten notes from the 1880s and it can read them significantly more effectively than I can!
As with many of the tools I build, s3-ocr is a pretty small Python Click app - it's around 400 lines of code that wraps boto3: github.com/simonw/s3-ocr/…
I can't say enough good things about Click by @PalletsTeam - click.palletsprojects.com - I often find myself upgrading ad-hoc scripts from Jupyter notebooks to full CLI tools just because Click makes it so fast and productive to do so
• • •
Missing some Tweet in this thread? You can try to
force a refresh
For anyone too lazy to click on that link, I'm going to highlight a few of those highlights in a short Twitter thread
getElementsBySelector() was a JavaScript function I released in March 2003 that let you, well, get elements by selector! It ended up being part of the initial inspiration for jQuery
Found myself wanting to programmatically generate a map image with a marker on it using @openstreetmap
So I made a thing! url-map is a tiny website which accepts URL parameters and renders a Leaflet map... so you can generate screenshots with shot-scraper github.com/simonw/url-map
Here's an example image taken using this command line incantation:
The implementation is tiny - just one 36 line HTML file. It's a very thin adaption of the first example in the Leaflet documentation github.com/simonw/url-map…
An interesting thing about working with GPT-3 is that you have to fact check /everything/, because it excels at generating convincing text but makes no promises that the facts in that text will be accurate in any way
In writing a tutorial about Datasette it invented several features that don't exist at all, some of which were actually pretty good ideas!
I really hope this holds. It's been quite a few years now for me that SPAs have felt like a very expensive trip in the wrong direction for most of the things they are being used for
I'm not sure how I missed Paint Holding being added to Chrome a few years back! Chrome waits 500ms to see if it gets a "first contentful paint" when you navigate between pages - if you respond that quickly it won't show a flash of white in between pages developer.chrome.com/blog/paint-hol…
Anyone know if Firefox has shipped Paint Holding yet? I found this relevant issue from a year ago but it wasn't clear if the feature had been shipped in a release yet bugzilla.mozilla.org/show_bug.cgi?i…
Released shot-scraper 0.14 - the latest version of my CLI tool for automating screenshots of web pages
The main feature in this release is a new, expanded documentation website (instead of cramming everything in the README) shot-scraper.datasette.io/en/stable/
The other new feature in this release was suggested by @psychemedia - you can now use --wait-for EXPRESSION to tell shot-scraper to wait until a specific JavaScript expression returns true before taking the shot shot-scraper.datasette.io/en/stable/scre…
Wrote a comment explaining how to achieve read-after-write consistency in a DB replicated web application by setting an expires-in-5s cookie any time a user makes a write that causes their subsequent requests to go to the lead database news.ycombinator.com/item?id=314340…
The @flydotio fly-replay: HTTP header continues to be my favourite architectural hack of the last few years fly.io/blog/globally-…
Another neat trick for this is to store some kind of global transaction ID in the cookie such that the web qpp can tell if the user's write has been applied to the replica yet - Chris McCord of Elixir/Phoenix explains that in this comment: news.ycombinator.com/item?id=314340…