Neat negative result spotted at #ACL2021:
I've seen a number of efforts that try to use MNLI models to do other classification tasks by checking whether the input entails statements like 'this is a negative review'. (1/...)
This never really made sense. The data collection process behind SNLI/MNLI was meant to capture the relationship between two things that the same speaker could have said in the same situation.
That means strings like 'the text' or 'the premise' or 'the author' are rare in MNLI, and when they appear, they refer to something _that was referred to in the premise_, not to the premise itself or to its author.
So, examples like the one in the screenshot seem broken—that kind of hypothesis doesn't make sense in the context of most presises, and we should expect models to be confused, and to behave erratically.
And they do! Tingting Ma et al. from MSR find that when using this kind of prompting, you'll actually do _better_ if you use a next-sentence-prediction model rather than an MNLI model, ...
... suggesting that the limited successes we've seen here have to do with a vague ability for BERT-style models to recognize topic/style similarity, and that the specific abilities that a model learns from MNLI aren't really helping. The world makes sense!
The fact that there's been total silence from the @aclmeeting channel about the wave of technical issues that has derailed much of the conference is pretty bizarre. Especially when there's no other working channel for technical support or rapid updates.
For anyone tempted to boycott future ACL events, be reassured (?) that this event is put together almost from scratch by an almost-entirely-new group of volunteer researchers every year for some reason.
So, (i) I don't envy the Virtual Infrastructure Committee right now—they're presumably just figuring this out themselves—and (ii) they won't have this job next time.
In my experience with *ACL events, reviewer and AC expectations don't differ in any significant or predictable way across tracks. (Plus, many other AI/ML conferences don't use tracks, and it doesn't seem like the dynamics at these conferences are meaningfully different.)
So, adding/removing/renaming tracks doesn't, on its own, seem likely to make any predictable change in outcomes.
We need more of this ethically serious, academically careful discussion of what we're doing.
(I'm on record arguing against model-in-the-loop data collection for test sets, which Chris metions, and I still think it's incompatible to the goals stated here, but the Dyna*Board*-style leaderboard design that Chris focuses on more directly in the talk is important/exciting.)
I've been thinking a lot lately about what we can do to keep pushing progress on language understanding once we start to reach the scaling limits of self-supervised pretraining... (🚨 new paper, thread 🚨)
Grounding and embodiment are obviously one promising direction, but there's a lot that will be difficult or impossible to learn that way under anything that resembles current technology.
How about we just create or use *annotated* data to teach our models the skills they aren't already learning well through pretraining? We already know that this works, but we don't know much about when or why...