Yes, it's a thread. Because I'm doing something different.
1/23
I wrote a tool to do OCR (optical character recognition)—converting the PDF to searchable text—w/ 4 libraries. Using "natural language processing" (NLP) tools, I identified sections where tools (e.g. Abbyy, Tesseract) did a good job or a crappy job.
2/23
3/23
4/23
This part consists of a few steps. First is to identify "entities" in the text. That means, among other things: dates, names, legal attributes (e.g. references to the US Code), and locations. This model was custom tuned and trained several times.
5/23
6/23
7/23
8/23
9/23
Why is that helpful?
Well, I've done the same thing with previous filings…
10/23
11/23
"I like apple pie" has four words; 3 "bi-grams" (I like; like apple; apple pie); 2 "tri-grams" (I like apple; like apple pie).
12/23
13/23
14/23
15/23
16/23
17/23
18/23
19/23
20/23
21/23
22/23
Section 508 requires your PDF to be accessible to users of assistive technology—like screen readers or Braille displays.
You literally violated federal law with the scanned report.
Then, I’ll re-run the NER and other steps, and proceed with the timeline.
Extra caffeine-infused thanks to Sherri, Nate, and Andrianna for the Venmo love!
Okay, time to get back to the grunt work part of this endeavor. I'll update with a progress report soon.
This would've been so much faster if they cared about section 508.
(cont.)
Downside: OCR frequently chokes on NUMBERS of all things (machine learning nerds will see the irony in that, since most will have written code to OCR numbers)
I’ll write some more tech details tomorrow, along with samples of the progress. In the meantime, I’mma take a break from coding. 🤓
There are plenty of bad reasons, though. Grr