Simon Willison Profile picture
Jun 30 7 tweets 3 min read
I built a new tool: s3-ocr, a utility for running OCR (with Amazon Textract) against every PDF file in a S3 bucket and getting the results back as a searchable SQLite database simonwillison.net/2022/Jun/30/s3…
Here's a demo of the database it generated running against 3 PDF files from the Library of Congress Harry Houdini Collection on the Internet Archive s3-ocr-demo.datasette.io/pages/pages?_f…
AWS Textract is /amazing/. I've run it against PDFs containing scanned handwritten notes from the 1880s and it can read them significantly more effectively than I can!
Here's an earlier TIL I wrote when I was still figuring out how to use Textract from Python code: til.simonwillison.net/aws/ocr-pdf-te…
If you're interested in helping out with the @SFMicroSociety project I built this for get in touch with Ariel
As with many of the tools I build, s3-ocr is a pretty small Python Click app - it's around 400 lines of code that wraps boto3: github.com/simonw/s3-ocr/…
I can't say enough good things about Click by @PalletsTeam - click.palletsprojects.com - I often find myself upgrading ad-hoc scripts from Jupyter notebooks to full CLI tools just because Click makes it so fast and productive to do so

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Simon Willison

Simon Willison Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @simonw

Jun 12
It's the 20th anniversary of the launch of my blog, simonwillison.net!

I decided to celebrate by trying to pull together some highlighted content from the past 20 years simonwillison.net/2022/Jun/12/tw…
For anyone too lazy to click on that link, I'm going to highlight a few of those highlights in a short Twitter thread
getElementsBySelector() was a JavaScript function I released in March 2003 that let you, well, get elements by selector! It ended up being part of the initial inspiration for jQuery

Think of it as the original polyfill for document.querySelectorAll() simonwillison.net/2003/Mar/25/ge…
Read 22 tweets
Jun 12
Found myself wanting to programmatically generate a map image with a marker on it using @openstreetmap

So I made a thing! url-map is a tiny website which accepts URL parameters and renders a Leaflet map... so you can generate screenshots with shot-scraper github.com/simonw/url-map
Here's an example image taken using this command line incantation:

shot-scraper 'simonw.github.io/url-map/?cente…' \
--retina --width 600 --height 400 --wait 3000 A map of the area of England around London, with two blue ma
The implementation is tiny - just one 36 line HTML file. It's a very thin adaption of the first example in the Leaflet documentation github.com/simonw/url-map…
Read 8 tweets
May 31
I asked GPT-3 to write a tutorial for getting started with Datasette, and a marketing landing page for my forthcoming Datasette Cloud hosting service.

The results really were disconcertingly good! simonwillison.net/2022/May/31/a-…
An interesting thing about working with GPT-3 is that you have to fact check /everything/, because it excels at generating convincing text but makes no promises that the facts in that text will be accurate in any way
In writing a tutorial about Datasette it invented several features that don't exist at all, some of which were actually pretty good ideas!
Read 8 tweets
May 22
The balance has shifted away from SPAs - nolanlawson.com/2022/05/21/the…

I really hope this holds. It's been quite a few years now for me that SPAs have felt like a very expensive trip in the wrong direction for most of the things they are being used for
I'm not sure how I missed Paint Holding being added to Chrome a few years back! Chrome waits 500ms to see if it gets a "first contentful paint" when you navigate between pages - if you respond that quickly it won't show a flash of white in between pages
developer.chrome.com/blog/paint-hol…
Anyone know if Firefox has shipped Paint Holding yet? I found this relevant issue from a year ago but it wasn't clear if the feature had been shipped in a release yet bugzilla.mozilla.org/show_bug.cgi?i…
Read 5 tweets
May 19
Released shot-scraper 0.14 - the latest version of my CLI tool for automating screenshots of web pages

The main feature in this release is a new, expanded documentation website (instead of cramming everything in the README) shot-scraper.datasette.io/en/stable/ A screenshot of the index p...
The other new feature in this release was suggested by @psychemedia - you can now use --wait-for EXPRESSION to tell shot-scraper to wait until a specific JavaScript expression returns true before taking the shot shot-scraper.datasette.io/en/stable/scre… Waiting until a specific co...
Plus I added a whole new page of documentation about using shot-scraper with GitHub Actions shot-scraper.datasette.io/en/stable/gith… Using shot-scraper with Git...
Read 6 tweets
May 19
Wrote a comment explaining how to achieve read-after-write consistency in a DB replicated web application by setting an expires-in-5s cookie any time a user makes a write that causes their subsequent requests to go to the lead database news.ycombinator.com/item?id=314340… > It seems like this would ...
The @flydotio fly-replay: HTTP header continues to be my favourite architectural hack of the last few years fly.io/blog/globally-…
Another neat trick for this is to store some kind of global transaction ID in the cookie such that the web qpp can tell if the user's write has been applied to the replica yet - Chris McCord of Elixir/Phoenix explains that in this comment: news.ycombinator.com/item?id=314340…
Read 4 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(