A brilliant article with insights from @emilymbender, @sarahbmyers (@AINowInstitute), and more. But taking a step back:
As an NLP researcher, I'm asking what the freaking hell is anyone doing grading student essays with automated tools that I'd not trust on my academic datasets?
In 18 states "only a small percentage of students’ essays ... will be randomly selected for a human grader to double check the machine’s work".
In writing you're tasked with speaking to and convincing an audience through a complex, lossy, and fluid medium: language.
Guess what NLP is still bad at? Even if the marks aren't determining your life (!) the feedback you receive will be beyond useless. You're not having a conversation with a human. You're not convincing them. You're at best tricking a machine. A likely terribly ineffective machine.
Do you think that these systems from closed companies are equivalent in performance to the State of the Art in academia? Here's a hint: they definitely aren't. We know for certain the logic and reasoning of our existing SotA tools are unreliable in the best circumstances too.
Why do we think machines are ready to judge the words of any human, let alone a young student where the feedback will potentially shape their mind and their life? To intelligently deconstruct their writing and offer insight into how they can better themselves? To _judge_ them?
We've taken the already problematic concept of "teaching to the test" and elevated it to parody.
The test is free form text marked by a machine that can't read or write language with true logic or reasoning.
Write an essay that can trick this system into scoring you well.
This is our intellectual dystopia version of a Brave New World. We've replace reason with poorly approximated logic in the most dangerous of places. We'll only see these perverse interactions play out in the long span. A generation of students taught and judged by broken machine.
How about a sanity check?
Can the automated grading system even approximately answer the question it's grading?
We'd expect that from a human marker, right?
That doesn't guarantee it'll grade well but at least it's a first level sanity pass. This is not a "simple" question...
Maybe "more fair" - let's at least see how these grading systems perform on grading a selection of correct / incorrect answers to elementary and middle school questions from @allen_ai's ARISTO. I don't think you'll be shocked by the outcome ... -_- allenai.org/aristo/
• • •
Missing some Tweet in this thread? You can try to
force a refresh
To add to a night of technical oddities there are three Cruise vehicles, all (literally) driverless, stuck at and partially blocking the corner of Geary and Mason 😅
There were originally four Cruise vehicles but one eventually made a grand escape. The leading Cruise vehicle has been there at least fifteen minutes as that's how long I had to wait for fast food. Occasionally one of them would lurch forward a little just for added suspense 🙃
To note the ones behind it that are occasionally moving have a different UI state so maybe they're just being particularly wary ¯\_(ツ)_/¯
For those in the language modeling space, a question regarding perplexity as a metric with varying tokenization:
- Is there a hard proof showing for a dataset D being tokenized using A and B that the perplexity is equivalent?
- Does that proof take into account teacher forcing?
I ask as I have never seen a proof and always assumed smarter people than myself had thought about it. Intuitively I felt it reasonable until I recently began pondering over the teacher forcing aspect which is essentially giving your model supervision, including at test time.
Imagine you had the task of language modeling:
"Bob and Alice were fighting for first place but who won? [predict: Bob or Alice]"
The claim is that the language model's perplexity (confusion) should be equal regardless of how we split the text.
In Dec 2016 Uber started taking _paid rides_ in _self driving cars_ without even filing for an autonomous testing permit in CA. That first day in SF it blew multiple red lights and had disengagement numbers hundreds of times worse than other companies.
Less than two years later, Uber having upped and left San Francisco due to their egregious behaviour, their self driving car killed someone. I collected why, in a thread, I had zero faith in their ability to safely execute and their checkered past.
Today: National Safety Transportation Board (NTSB) noted the system "did not include a consideration for jaywalking pedestrians". Elaine Herzberg was classified as a flurry of objects {other, bike, vehicle, ...} 5.6 seconds before impact. theregister.co.uk/2019/11/06/ube…
Deep learning training tip that I realized I do but never learned from anyone - when tweaking your model for improving gradient flow / speed to converge, keep the exact same random seed (hyperparameters and weight initializations) and only modify the model interactions.
- Your model runs will have the exact same perplexity spikes (hits confusing data at the same time)
- You can compare timestamp / batch results in early training as a pseudo-estimate of convergence
- Improved gradient flow visibly helps the same init do better
Important to change out the random seed occasionally when you think you've isolated progress but minimizing noise during experimentation is OP. You're already dealing with millions of parameters and billions of calculations. You don't need any more confusion in the process.
I'm incredibly proud that the low compute / low resource AWD-LSTM and QRNN that I helped develop at @SFResearch live on as first class architectures in the @fastdotai community :)
I think the community has become blind in the BERT / Attention Is All You Need era. If you think a singular architecture is the best, for whatever metric you're focused on, remind yourself of the recent history of model architecture evolution.
Whilst pretrained weights can be an advantage it also ties you to someone else's whims. Did they train on a dataset that fits your task? Was your task ever intended? Did their setup have idiosyncrasies that might bite you? Will you hit a finetuning progress dead end?
What is OpenAI? I don't know anymore.
A non-profit that leveraged good will whilst silently giving out equity for years prepping a shift to for-profit that is now seeking to license closed tech through a third party by segmenting tech under a banner of pre/post "AGI" technology?
The non-profit/for-profit/investor partnership is held together by a set of legal documents that are entirely novel (=bad term in legal docs), are non-public + unclear, have no case precedence, yet promise to wed operation to a vague (and already re-interpreted) OpenAI Charter.
The claim is that AGI needs to be carefully and collaboratively guided into existence yet the output of almost every other existing commercial lab is more open. OpenAI runs a closed ecosystem where they primarily don't or won't trust outside of a small bubble.