Profile picture
David Burge @iowahawkblog
, 23 tweets, 3 min read Read on Twitter
Serious questions re "machine learning":
What are your criterion variables?
What are your predictor variables, e.g. "account behavior"?
How do you parse, say, sarcasm or irony?
Define "healthy conversation."
Don't worry, we did a regression herpa derp
I feel tempted to go on a 50 tweet rant about the myths and realities of "machine learning" and its uses/abuses
here goes. Feel free to mute me for rest of the day if this is boring to you.
first off "machine learning" is basically a bullshit marketing term for a class of mathematical models for classifying things or fitting a relationship between outcome variables (Ys) and predictor variables (Xs). Regression is an example.
There are any number of similar models, ANNs, RBFs, Random Forests, C5.0, CARTs, blah blah blah etc., all of which are variants of good old regression and all having the same objective: to find a mathematical relationship between Ys and Xs.
Simple example: suppose you had a lemonade stand and every day you recorded # of Glasses of Lemonade Sold (Y) plus Lemonade Price, Competitor Lemonade Price, & Outside Temperature (X1, X2, X3).
Any of these "machine learning" algorithms could help you understand/predict how many glasses of lemonade you will sell under various prices and temperatures. Some might be better/more accurate than others.
Now let's consider another machine learning problem: deciding whether a Twitter user is a piece of shit based on their twitter behavior.
First off, this is a bit different than the Lemonade problem in that the source data is text based rather than numeric. Lemonade price and sales are pretty straightforward numeric, but what about a tweet that says "LOL your mom is a pig"?
In order to model text data it has to be presented in a numeric way. For example, X1 might be a 0/1 indicator of the presence of the word "mom" and X2 similar for "pig."
So we might consider looking at whether how much a Twitter POS someone is based on the number of times they used the words "mom" or "pig" or the combination of times they used both, or other bad words.
There are canned algorithms that reference a dictionary that purport to identify a numeric "sentiment" in a text string ("LOL ur moms a pig"=-0.83), but not always reliable. Algorithms for sarcasm detection are worse.
All those issues aside, suppose we could quantify/ enumerate the totality of someone's twitter behavior into a series of numeric values representing words used in replies, sentiment, sarcasm, people talked to, etc.
Now here comes the real problem: suppose we had those "behavioral" values for a big sample of Twitter users. These are the predictor variables (Xs). Now we need a Y value indicating what a POS they are. I mean their "detraction from conversational health."
Exactly how you would determine Y here is a poser. Is it the number of times they've been reported or banned? Number of followers? Number of followers that have been banned? Number of followers of the people who reported them?
Let's ignore that for the moment and boil it down to Twitter's problem at hand: a "machine learning" model that quantifies the following relationship:

How Much a POS This Account Is = f(All the Crap They Say On Twitter)
Here's what I guess: they threw a bunch of this data, for some sample of users, into some flavor of neural network, and VOILA! A magical robot now scores everybody on Twitter on POS. I'd guess they have somewhere stored my, and your POS value.
Further I'd guess POS score is largely determined by a relatively small set of behaviors (presence/frequency of various words, reply activity).
The method- ANNs, random forests, etc., isn't nearly as important here as the data representation. For example, if Twitter is providing a POS score for their machine learning sample, how do they define it?
For example let's say someone's POS value = the number of times they've been reported. What if they were actually a nice guy, and they were targeted/reported by a bunch of low-level actual POSs?
After enough training examples of this, the machine might tell you words like "I don't like nazis" and "Ferentz" were harbingers of evil.
I'm guessing that's how the whole shadow banning issue came up. I seriously doubt Twitter has hardcoded "ban republicans" into an algorithm, it's just a shittily designed algorithm.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to David Burge
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!