📢 A 🧵 on the Trends in NLP Datasets.

What’s changed since SQuAD was all the rage in 2016? A: A LOT. 🔭

1. Generic ➡️ Niche Tasks
2. Task-specific Training+Eval ➡️ Eval Only
3. Dataset ➡️ Benchmark ➡️ Massive Collections
4. Datasets ➡️ Diagnostics

What started as a trickle became an explosion of NLP datasets over the last few years.

@sebastian ruder used to track all NLP sets on his website: nlpprogress.com. It’s no longer possible to keep up-to-date.

🌟 Trend 1 🌟 Generic dataset are replaced with more niche datasets.

⏳ Before: datasets released for general tasks.

⌛️ Now: We see tasks targeting hyper-specific abilities.


For general QA ➡️ SQuAD
➕ Retrieval ➡️ Open SQuAD
➕Other domains ➡️ NewsQA, BioASQ, TriviaQA …
➕Multilingual ➡️ MLQA, XQuAD, TyDiQA, XORQA, MKQA…
➕Adversarial / Numerical / Causal / Social / Physical Reasoning ➡️ Adversarial SQuAD, DROP, ROPES, SocialIQA, PIQA,…

🌟 Trend 2 🌟

⏳ Before: It was important to release a Training set w/ an eval task.

Why use it if you need to find/prep your own train set?

1. There are Training sets for *almost* everything
2. LLMs are expected to generalize to anything with an instruction

🌟 Trend 3 🌟

Many tasks can be packaged into benchmarks, representing larger evaluation concepts.

➕GLUE and SuperGLUE ➡️ General English NLU
➕XGLUE and XTREME ➡️ General Multilingual NLU
➕KILT ➡️ Knowledge intensive NLU

But now LLMs want to evaluate on 100+ tasks. (Because w/ zero- or few-shot they easily can!)

Can one eval suite answer:
(A)Where did we get SOTA?
(B) Are there emergent properties?
(C) What are the remaining weaknesses?

To answer these Qs, benchmarks were cannibalized and collated into 📚massive collections📚, spanning many loosely grouped skills.

➕FLAN, T0, ExT5, MetaICL ➡️ 100s tasks each
➕BigBench ➡️ 100+ tasks
➕Natural Instructions ➡️ 1600+ tasks
➕🤗 Datasets ➡️ 1000+ tasks

🌟 Trend 4 🌟

Analysis and Diagnostics are gradually being elevated on par with Eval Datasets.

This is important since Evaluations often drive research community incentives.


➕Heuristic Analysis (HANS) by McCoy, Pavlick, @tallinzen
➕Behavioral Testing Checklist by @marcotcr @tongshuangwu
➕ ANLIzying the Adversarial NLI Dataset by @adinamwilliams, @TristanThrush, @douwekiela

The panel at @DADCworkshop #NAACL2022 was full of interesting future ideas for dataset development:

➕Expiration dates on training sets
➕Interactive datasets w/ human-in-the-loop
➕Refreshing datasets adversarially w/ man/machine

🌐: dadcworkshop.github.io

Thank you for reading!

And thanks to @_jasonwei, @albertwebson, @emilyrreif for feedback on this 🧵!

NB: I couldn’t cite all the great exs of these trends in a short thread, but please comment if I missed any great ones, you agree, or disagree! :)


• • •

Missing some Tweet in this thread? You can try to force a refresh

Keep Current with Shayne Longpre

Shayne Longpre Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!


Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @ShayneRedford

Jun 14
📢 A 🧵on the future of NLP model inputs.

What are the options and where are we going? 🔭

1. Task-specific finetuning (FT)
2. Zero-shot prompting
3. Few-shot prompting
4. Chain of thought (CoT)
5. Parameter-efficient finetuning (PEFT)
6. Dialog

[1/] ImageImage
🌟Task-specific finetuning 🌟

The traditional way to prepare NLP models for deployment, it usually obtains the best performance for a specific task, but:

(a) it requires many training examples
(b) it (often) specializes a model for ONE task and ONE data input format ONLY

Because large language models (LLMs) can be:

(a) v expensive to train, and
(b) have emergent capabilities to interpret a NEW task from only an instruction

researchers are experimenting with new strategies to get model predictions…

Read 16 tweets
May 28
Sharing my *rough* slides from a @CCCatMIT February reading group.

Covers "NLP Training Trends for Large Language Models" (LLM) and a survey of 4 new interesting papers: FLAN, T0, ExT5, MetaICL!

📚: bit.ly/3a3SxOj [1/6]
1st paper we discuss multi-task fine-tuning in FLAN by @_jasonwei, @MaartenBosma, et al.

TLDR: Multi-task instruction tuning a 137B model on dozens of tasks vastly improves zero/few-shot learning

📜: arxiv.org/abs/2109.01652 [2/6]
2nd paper we discuss @huggingface's T0 (11B) by @SanhEstPasMoi, @albertwebson, @colinraffel, @stevebach, et al.

TLDR: Scale and diversity of P3 prompts yields better 0-shot generalization, even to held-out tasks. Competes w/ far larger models!

📜: arxiv.org/abs/2110.08207 [3/6]
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!


0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy


3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!