, 14 tweets, 5 min read Read on Twitter
Announcing DROP, a new reading comprehension benchmark that requires discrete reasoning over paragraphs of text. New @NAACLHLT paper by @ddua17, @yizhongwyz, @pdasigi, @GabiStanovsky, @sameer_, and me. allennlp.org/drop.html arxiv.org/abs/1903.00161
I am super excited about this; I've been thinking about this for over a year, and we finally decided to pursue it as our first collaboration between AI2 Irvine and the UCI NLP group. This is a hard dataset that uses complex questions to test comprehensive understanding.
Key idea: use compositional questions inspired by the semantic parsing literature to put together many pieces of information from a single paragraph. You must get _multiple_ structures from the paragraph correct in order to answer most questions.
Things like "Which players made touchdowns longer than 10 yards?", "How many empires attacked Guadalajara?", "Which two ethnicities were tied for the most common?" and "How long was the second longest field goal?".
The data is collected with an *adversarial baseline* running in the background - when crowd workers ask questions, we send them to a server running BiDAF, and if BiDAF gets it right, we tell them to ask a harder question. We also give lots of examples of hard questions.
@ddua17 did an amazing job designing the data collection aspect of this, and pushing through getting a 96k dataset. @yizhongwyz came up with a great extension to QANet that adds some simple numerical reasoning on top.
Key result: best baseline model (BERT SQuAD model, retrained on DROP) gets ~32 F1. @yizhongwyz's NAQANet gets ~47 F1. Humans get ~96 F1. Still a *long* way to go. But it's feasible, with a similar format to SQuAD - just adding numbers, dates, and multiple spans as outputs.
You can try out a demo here: demo.allennlp.org/reading-compre…. I put a lot of interesting examples in there, highlighting the different kinds of questions and capabilities of the model (and some where it fails!).
I was pretty shocked at how well this model can do maxes and mins. You can change the numbers in the paragraph and it gets it right most of the time. Its counting ability is pretty rudimentary, though (it answers 2 most of the time), and "second largest" is too hard for it.
Play around with the demo, and let us know what you think! Self-serve leaderboard with a hidden test set will be coming soon (getting the self-serve part ready is taking a bit of time - it'll be based on docker and beaker).
Oh, another thing - this model often does *worse than BiDAF* on SQuAD-like questions (and BiDAF isn't all that great these days). You can see this in the demo by switching back and forth. We still have a long way to go on general reading comprehension systems.
Also, the paper is still not camera ready - some of the dataset analysis needs to be updated to include the (harder) second half of the dataset that was collected after the submission deadline. But the results are correct, and on the full dataset.
Another additional commentary thread (just attaching these here for easy finding):
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Matt Gardner
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!