, 21 tweets, 5 min read Read on Twitter
Today, as I get back to the analysis exercise referenced in the old thread below, I’ll be updating here, in a new one.

First step: “fish or cut bait” on the full extraction and linking of footnotes. I may punt it to later…but I have a few ideas I’ll try

Then: the ethics of using machine learning approaches to guess at what will fit under the redaction boxes.

Short answer: technically feasible (for some definition) with language models that are trained with every indictment, sentencing memo, etc. from Mueller. Ethically wrong.
I’ve been reading a number of posts/pages on the topic, and now that I’m well past the honeymoon portion of this endeavor, the novelty of having a model make even one successful prediction has lost its appeal.

Ethics—now, more than ever—matter.

I’ll share some links later. 👍
So: it’s breakfast time, then back to cleaning and finishing the training of the named entity extraction (NER) model so I can get the time stamped data out. Footnotes will still be parsed, so a timeline of witness interviews will be part of the results.

More details to come!
Progress: Every sentence (+footnotes) that contains anything identified as an absolute date, extracted.

Next, updating the code to properly handle: relative dates ("two days later") using the previously matched date as context; inexact dates ("Nov. 2016"); group results by date Screenshot of HTML output showing a list of dates, and events that occurred on those dates.
Another view, showing density of "events" noted for a given weekday of each month.

Next tasks:

• more data cleansing (yay)
• finalizing a standard structure for the raw data
• joining in data from Twitter, and
• sharing the results with you all! A
In the previous image, May 17, 2017 has a particularly dark shade applied to it. Sure enough, it was a busy day for the Trump team.

Once witness interviews are able to be filtered—and presidential Tweets or other data are correlated—we might discover some interesting things! Screenshot showing a listing of events documented in the Mueller Report that occur on May 17, 2017
(Well, "busy" is in reality an overstatement for May 17th. While it appears multiple times in the report, most occurrences are simply referencing Mueller's appointment.)
I’m going to press on with analyses and finalizing the format of the data, in spite of remaining extant messy text. I’ll crowdsource the rest of the text corrections afterward, I’m that confident enough the rest of my analysis pipeline is close to a workable state.
Mental health break over. Got some caffeine to pick up (thanks to new supporters Allan and Richard!) and then it’s back to the grind.
Want to show support kick in to the caffeine fund? Venmo is always appreciated, as are your questions, comments and DMs! Thanks for joining me on this adventure in geekdom!

More updates to come…

• relative date references now (mostly) working, calculating the difference from the previous absolute date
• loaded Trump's tweets into timeline format
• now doing some stats on locations and people references

I'm excited by how far this has come!

OCR: I'm suspending remaining OCR clean-up until after I get correlated timelines and other interesting analyses out to everyone

Witness interviews: all dated and tagged based on footnote references

Woah: using grammar parse, ID's subjects/objects for many statements

That last one was a bit of a surprising discovery. Just by using natural language processing to analyze parts of speech and each word's role in the sentence structure, we can get a better statistical feel for "who was doing what to/for whom".

I should have barebones (not pretty, per se) timelines later tonight, along with statistical analyses of the players; their actions; locations involved; and one other interesting thing I found: the deliberate choice of words that /seem/ like synonyms, but in legal terms…aren't.
Cool. Did "someone adverbly verb", that's what I asked here for this visualization.

Okay, grammar nerds: what string of grammar rules (parts of speech, dependencies, etc) would find some interesting things if sentence fragments matched those rules?
Had a setback with the dates: sentences with multiple dates are supposed to appear under each of those dates in the final dataset, but for some reason, and they're not.

I've been at it for a while, but need to shift gears because I'm going around in circles.

I also have a question for you tweeps in the legal world:

Is there a specific style guide—like the attorney's version of the AP Stylebook—that attorneys generally follow? Something that specifies citation style, approved formats for dates, usage of names, etc.?

Any attorneys that want to collaborate directly, just send a DM. I suspect some of what I've been putting together may be useful for you anyway, and you'd be helping to improve it. 🤓

Anyway, since I'm taking a break, the timeline will be out tomorrow. Once it's out, I'll put up a site for crowdsource volunteers to collaboratively fix OCR errors to help improve the data coming out of these 400+ pages.

Finally: Thanks for your Venmo support, Parker!
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to ➖Dustin Miller➖
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!