, 51 tweets, 10 min read Read on Twitter
I know a lot of media personalities, Twitter "influencers", or whatever have shared their treatment of the Mueller Report today. I'm a big ol' nerd, so here's what I'm doing with all this delicious, delicious data.

Yes, it's a thread. Because I'm doing something different.

Part A: Data Cleansing

I wrote a tool to do OCR (optical character recognition)—converting the PDF to searchable text—w/ 4 libraries. Using "natural language processing" (NLP) tools, I identified sections where tools (e.g. Abbyy, Tesseract) did a good job or a crappy job.

Two of the OCR libraries I used allowed for their recognition networks to be "trained" with new words or symbols, so part of that data cleansing allowed me to "teach" them to recognize some words (names) or symbols that those libraries might otherwise misread.

I know that many others have posted OCR'd versions of the PDF; all that I found used one OCR engine to OCR the PDF. By using NLP to "grade" each tool's output, I was able to combine them to extract a more accurate plain-text of the PDF. I'll share that final text later.

Part B: NLP Analysis

This part consists of a few steps. First is to identify "entities" in the text. That means, among other things: dates, names, legal attributes (e.g. references to the US Code), and locations. This model was custom tuned and trained several times.

This is called "named entity recognition" (NER), and it allows me to do stuff like: print a list of sentences in the chronological order of the date referenced in each sentence, or create a network graph showing the connections between people mentioned in the report.

(I know I'm giving away a lot of secret sauce here, but all this is the messy, dirty truth of data science. 95% of the work is cleaning up crappy data. It's thankless, tedious, and frustrating. But it's the only way to enable the other steps.)

After NER, the text is tagged for parts-of-speech. It's frankly too tedious (and often not worth it) to confirm 100% accuracy here—most approaches are about 90% accurate—but it's worth even 90%. Why? Because that information can "train" a program to do some cool things…

…Like being able to search for "<person> <verb, past tense> <person>" and identify every sentence where one person [said/texted/messaged/emailed/signaled/called/tickled/whatever] another. That's one example. It also gets used to establish something called "co-references".

Co-references are when two people are mentioned together in the same sentence clause, full sentence, or even paragraph (depending on what you're hoping to accomplish).

Why is that helpful?

Well, I've done the same thing with previous filings…

Which means that I can write a simple (relative term, I know) machine learning algorithm that can make predictions of redacted names based on prior co-references. That involves predicting, rendering them in the same font, and determining if they fit the redaction boxes.

Next step in my NLP is to break down the whole document into individual sentences, words, and multiple-word sets (called n-grams). For example;

"I like apple pie" has four words; 3 "bi-grams" (I like; like apple; apple pie); 2 "tri-grams" (I like apple; like apple pie).

N-grams (where "N" is usually a number between 1–3) are used in combination with some really cool machine learning tools called "embeddings", which convert them to numbers (more technically: vectors) to make it easier to process them and extract interesting information.

So what's the point of all that? It's easier to identify dates, people and locations that are important; and more interestingly, identify phrases that are similar but for the people/places/locations. This can also help with "predicting" the text behind redactions.

Using NER to identify dates in the text of the report has a nice side effect: correlating statements and events in the report with public news stories, tweets/FB posts/IG posts made by IRA-operated accounts, and those of campaign officials.

My end product there will (knock wood) be a complete timeline of every event that was referenced in the report, major related news stories (which may "unmask" more redactions), and the social media activity of both foreign actors and campaign officials.

But wait, there's more. I have a multi-gigabyte data dump of aircraft flight tracking data—including flights hidden from FlightAware and similar sites—and a list of the codes used on aircraft owned by many of the oligarchs mentioned in the report.

Some of you may remember me tweeting about those flights a few years back. It got the attention of a few…characters…here on Twitter, but it was definitely interesting information. Some hand-checking already found (let's say) coincidental flights aligned with meetings.

That's why the NER step is so important. It enables correlation of information in the report with a multitude of other data from many disparate sources. It's that greater context that I hope will contribute something new and valuable to the conversation.

My hope is that it the final tagged, correlated data sets may prove useful to congressional investigators, as well as journalists (including other "citizen journalists"). I will make everything freely available as it is completed. It will likely be a few days.

I hope you found this thread to be an interesting (in a nerdy way) view into what I think will be a major component of newsrooms and investigative agencies in the future: using data science and machine learning to extract even more value from rich textual content.

Stay tuned for more updates on my progress tomorrow. When finished, I'll share my final raw data sets, as well as my trained models for OCR, NER and word/phrase embedding (which will require specific tools for other data science folks to utilize, details to be included)

If you want to support my efforts, I do enjoy caffeine, and would always appreciate a little Venmo love. 😊

I'm "spdustin" on Venmo, or you can scan my QR Code here:

I also appreciate job offers. My DMs are open.

Thanks, and stay tuned!

(P.S. I did take a lot of poetic license in describing some of this stuff; if you're a data science nerd, I know you can recognize the actual terms of art that I've simplified in this post. I welcome DMs with any questions about the technical details)
(P.P.S. Non-technical folks who'd like to know more, just reply to any message on the thread and I'll be happy to teach you more. Don't DM for those questions—everyone can benefit from the exchange. I love teaching, so don't hesitate to post your questions)
Holy crap, y'all.

(screenshot from a custom web app that manages my Twitter feed)
Update: here's a visual example of what "named entity recognition" (NER) can do. This was before I trained the recognizer with additional people, like "President" and the last-name-only references (the term for matching up last names to the correct full name is "dereferencing")
“results will not be ideal or identical since the source images are of relatively low quality. In particular, OCR errors will be more common adjacent to underlines and redactions.”

Ain’t that the truth. My automated OCR “merging” missed a lot of these.

Data cleansing is messy. …OCR …results will not be ideal or identical since the source images are of relatively low quality. In particular, OCR errors will be more common adjacent to underlines and redactions.<br />
<br />
Analysis: We assess that the document was most likely scanned twice, with redactions being added to the first scanned document using software.
It isn’t hard, @TheJusticeDept, to release a properly redacted •and accessible• PDF file.

Section 508 requires your PDF to be accessible to users of assistive technology—like screen readers or Braille displays.

You literally violated federal law with the scanned report.
The sort of analysis that I’m working on would have been easier if I enlisted volunteers early on to correct OCR errors. It’s too late now, though—I’m nearly finished correcting them all. Yuck.

Then, I’ll re-run the NER and other steps, and proceed with the timeline.
This does illuminate the realities of “data science”, and it’s an important message: most of what data scientists do is boring, tedious grunt work. Wrangling data into a usable format, fixing errors, finding outliers due to messy data…it’s not glamorous, but it’s 100% necessary.
That would be a really fun thing to do when I'm finally done with the cleaning and analysis!

Thanks for the great questions and the supportive comments, tweeps!

Extra caffeine-infused thanks to Sherri, Nate, and Andrianna for the Venmo love!

Okay, time to get back to the grunt work part of this endeavor. I'll update with a progress report soon.

Progress: I've gotten the text 50% annotated with good "named entities", and maybe 25% of the footnotes separated and linked—about 90% of all the text has correct OCR.

This would've been so much faster if they cared about section 508.

Soon, I'll get back to fixing the OCR, and separating the rest of the footnotes (with links to content). For my final timeline to really work the way I want, the whole report basically has to be converted to HTML (a web page), with paragraphs and citation links, etc.. So tiring!
(Any other geeks use kraken or ocropus? If you have found good models for them, would you let me know?)
Oh, I forgot the teaser: the "named entity recognition" step did a pretty good job with the names. I haven't updated that step with the "dereferenced" names (where the last name is seen as the same person as first and last name), but look: shiny! Word cloud showing the relative frequencies with which names were mentioned in the Mueller Report. Cohen, Flynn, McGahn, and Sessions stand out as frequent mentions.
Frequently mentioned dates in the Mueller Report: A world cloud of frequently mentioned dates in the Mueller Report.
One last teaser: screen snipped from my notebook showing sentences that contain a "date" entity as a full date. Image as described in tweet text
You’ll notice some of the “person” named entities a few teeets ago aren’t people. That’s using a smaller “model” (trained less than bigger ones) just so I can get the code right for creating the actual dataset of relevant events. The larger model takes longer, but works better.
Thanks to Sherri (again!) and Valerie for the boost! I appreciate the support!

Alright, I think I’ve got a good format for the dataset itself. I alternated between cleaning data, tuning and re-running my OCR Automated Proof (OCRAP, for the lulz) to improve OCR by selecting passages with fewer spelling errors, and exploring uses for those “named entities”.
Dates are tough, and a good example of why NLP tools can be a boon. It’s easy to search for “<Month> <1–2 digits>, <4 digits>”, but extracting date entities needs to catch stuff like “the following month” or “three days later”, and be able to work out what that references.
Yes, this is the sort of thing that interns and grad students would likely get assigned to do by hand (and a big group would be done by now), but I’m aiming to make this process automated and reusable on similar narratives to extract a chronological timeline of relevant info.
That’s where “natural language processing” (NLP) tools can shine. They can be programmed to de-reference relative “date” entities to find their context. “Two day later” can become <Date:2018-06-21> (or whatever) with a well reasoned processing pipeline.
That screenshot of sentences earlier? That has evolved enough now to change relative references into absolute dates.

Downside: OCR frequently chokes on NUMBERS of all things (machine learning nerds will see the irony in that, since most will have written code to OCR numbers)
Because of those types of errors, my data extraction can miss out on relevant info. Again, I do recognize a team of people could manually do the same work, but I think there’s great value in an automated pipeline to extract a timeline of events from a narrative.
Not to mention all the other vague date entities, like “May 2016”. How should that be visualized? How should a datestamped narrative be stored in a standardized way that works for other types of data? I haven’t found satisfactory answers to those questions, so I have to write it!
It’s a rewarding project, and one that I believe will have a growing (and lasting) value. I’m most thankful for the support from you, dear reader.

I’ll write some more tech details tomorrow, along with samples of the progress. In the meantime, I’mma take a break from coding. 🤓
I will add: I’m still not over the Section 508 violation from our asshat AG. There is no good reason to take content that was assuredly “born digital” (with secure redactions even), PRINT IT, and then fucking SCAN IT on the DOJ Ricoh.

There are plenty of bad reasons, though. Grr
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to ➖Dustin Miller➖
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!