Joe Barrow Profile picture
Jun 19 11 tweets 5 min read Read on X
New paper: every law in America is technically public. But not really, until now!

With @DenisPeskoff at UC Berkeley, we built a corpus of ~every publicly accessibly city and county law, and released a huge chunk of it!

2.2 million laws, you're (probably) covered in it!

🧵Image
First, the released dataset has pretty good geographic coverage, and accounts for a majority of the US population!

Everything including zoning, noise, housing codes, and whether or not you can ride an ATV without a license.Image
How do you build such a dataset when all the laws come in heterogeneous formats? To me the obvious answer is: OCR it all!

In this case, using the smallest/best OCR model we could, @LightOnIO's LightOnOCR-2-1b.

That gives us a high-quality, consistent format we can parse. Image
Image
First approach was to try and make my local GPUs go brrrr.

That worked quite well for the first million pages, but we had millions more to go, and needed to scale. Image
Image
Thankfully @modal (and @charles_irl) very kindly provided a compute grant to make this work happen!

You wanna *really* see GPU's go brrr? (And CPUs, for rendering)

We were able to process all 7M pages at ~$0.30/1k, much cheaper than options like AWS Textract ($1.50/1k pages).Image
Image
Once we had processed the laws, the question was: how to analyze them.

One idea I really like is @TheStalwart's , where he distills pairwise comparisons from an LLM + TrueSkill into a BERT model to score orality.Havelock.ai
In our case, we do the same for the laws across 4 axes: paternalism, opacity, enforcement discretion (how much leeway does the law provide), and salience.

We analyzed the 2.2 million laws and some interesting patterns emerged. Image
Image
California and Florida, you need to get your shit together so people can actually understand your laws! And Ohio and West Virginia, wtf is going on with how you run people's lives??
Our released dataset scores all 2.2 million laws along these axes, and we've released the distilled models for scoring/understanding the laws!

All distilled into @benjamin_warner/@bclavie/@antoine_chaffin/@orionweller/@answerdotai's ModernBERT.Image
Image
Paper Link: arxiv.org/abs/2606.19334
Dataset: huggingface.co/datasets/Local…
Models: huggingface.co/LocalLaws

And obviously a huge thank you to @modal for enabling this work.
N.B. I wonder if we're the first academic citation for @TheStalwart's @Havelock_AI.

If there's a better way to cite, let me know and I'll update the preprint! Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Joe Barrow

Joe Barrow Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(