1. Someone creates a dataset & describes it as a benchmark for some skill. Sometimes, the skill is well-scoped & the benchmark represents a portion of it reasonably (eg ASR for news in X language). Others, the skill is bogus (all the physiognomy tasks; IQ from text samples; &c).
2. ML researchers use the benchmarks to test different approaches to learning. This can be done well (for well-scoped tasks): which algorithms are suited to which tasks and why? (Requires error analysis, and understanding the task as well as the algorithm.)
3. The focus shifts from understanding how learning algorithms relate to different kinds of data/tasks to leaderboardism. Many progress, much arXiv, wow!
4. In this research paradigm, no time is spent on critical analysis of benchmarks, including the data that make them up but especially also the task definitions.
5. #AIhype kicks in and the research is framed as "solving" such grand challenges as "language understanding" or "visual understanding" or...
6. That critical analysis work is happening, in at least two places: Within ML, like the work you point to, with typically a narrow lens ("hey look, the labels are bad!") but maybe getting some attention from the #AIhype crew.
But more important versions of the work, that critique the underlying conceptions of tasks and goals and claims tends to be marginalized: seen as lower on the 'hierarchy of knowledge' (see @timnitGebru's talk at Spelman ).
The folks who think of themselves as at the top of that hierarchy because they are the ones who create & apply the algorithms seem to largely dismiss (or just ignore) the work of people like Birhane, Gebru, Benjamin, Noble, Mitchell, Raji, Whittaker.
Note that some of this critical work is coming from people with deep training in ML, Gebru, Mitchell & Raji among them, but I think it's still an uphill battle to have it be taken as serious research by researchers who think of themselves as 'core ML'.
So we get:
MLbro: Look, I've solved language!
Critical researcher: No, you haven't. That task is bogus.
Rev2: But you didn't show what other task would prove the claim, so reject.
Or:
MLbro: Look, I can predict criminality from faces!
Critical researcher: No you can't and claiming you can feeds into racism/other systems of oppression.
Rev2: Keep your politics out of our conference.
But this works:
MLbro: Look I've solved language!
Other ML researcher: Well actually, the dataset in that benchmark is really messy and maybe you're just modeling noise.
Rev2: Okay, I guess we'll let this one through.
If ML/AI were just an obscure academic field this might not matter so urgently, but that's not the world we live in. The hierarchies of knowledge (see ⬆️) are closely modeled by hierarchies of funding and meanwhile techcos are pushing out AI snakeoil that is doing real harm.
💯 this! Overfunding is bad for the overfunded fields, bad for researchers in the overfunded fields, and bad for fields left to starve, and bad for society as a result of both of those.
“I’ve been frustrated for a long time about the incentive structures that we have in place and how none of them seem to be appropriate for the kind of work I want to do,” -- @timnitGebru on the founding of @DAIRInstitute
@timnitGebru@DAIRInstitute “how to make a large corporation the most amount of money possible and how do we kill more people more efficiently,” Gebru said. “Those are […] goals under which we’ve organized all of the funding for AI research. So can we actually have an alternative?” bloomberg.com/news/articles/…
“AI needs to be brought back down to earth,” said Gebru, founder of DAIR. “It has been elevated to a superhuman level that leads us to believe it is both inevitable and beyond our control. >>
A few thoughts on citational practice and scams in the #ethicalAI space, inspired by something we discovered during my #ethNLP class today:
>>
Today's topic was "language variation and emergent bias", i.e. what happens when the training data isn't representative of the language varieties the system will be used with.
Week by week, we've been setting our reading questions/discussion points for the following week as we go, so that's where the questions listed for this week come from.
"Bender notes that Microsoft’s introduction of GPT-3 fails to meet the company’s own AI ethics guidelines, which include a principle of transparency" from @jjvincent on the @verge:
@jjvincent@verge The principles are well researched and sensible, and working with their customers to ensure compliance is a laudable goal. However, it is not clear to me how GPT-3 can be used in accordance with them.
About once a week, I get email from someone who'd like me to take the time to personally, individually, argue with them about the contents of Bender & Koller 2020. (Or, I gather, just agree that they are right and we were wrong.)
>>
I don't answer these emails. To do so would be a disservice to, at the very least, the students to whom I do owe my time, as well as my own research and my various other professional commitments.
>>
It's not that I object to people disagreeing with me! While I am committed to my own ideas, I don't have the hubris to believe I can't be wrong.