Tweet

Shayne Longpre

Oct 6 • 13 tweets • 7 min read

📢 A 🧵 on the Trends in NLP Datasets.

What’s changed since SQuAD was all the rage in 2016? A: A LOT. 🔭

1. Generic ➡️ Niche Tasks
2. Task-specific Training+Eval ➡️ Eval Only
3. Dataset ➡️ Benchmark ➡️ Massive Collections
4. Datasets ➡️ Diagnostics

1/

@sebastian

What started as a trickle became an explosion of NLP datasets over the last few years.

@sebastian ruder used to track all NLP sets on his website: nlpprogress.com. It’s no longer possible to keep up-to-date.

2/

🌟 Trend 1 🌟 Generic dataset are replaced with more niche datasets.

⏳ Before: datasets released for general tasks.

⌛️ Now: We see tasks targeting hyper-specific abilities.

Exs:

3/

For general QA ➡️ SQuAD
➕ Retrieval ➡️ Open SQuAD
➕Other domains ➡️ NewsQA, BioASQ, TriviaQA …
➕Multilingual ➡️ MLQA, XQuAD, TyDiQA, XORQA, MKQA…
➕Adversarial / Numerical / Causal / Social / Physical Reasoning ➡️ Adversarial SQuAD, DROP, ROPES, SocialIQA, PIQA,…

Etc.
4/

@annargrs

See work by @annargrs @nlpmattg @IAugenstein w/ detail on the QA dataset explosion.

📜: dl.acm.org/doi/10.1145/35…

https://twitter.com/annargrs/status/1576947545276026880?s=20&t=BYEPjMVhaGxPLLEIFolOGQ

🌟 Trend 2 🌟

⏳ Before: It was important to release a Training set w/ an eval task.

Why use it if you need to find/prep your own train set?

⌛️Now:
1. There are Training sets for *almost* everything
2. LLMs are expected to generalize to anything with an instruction

6/

🌟 Trend 3 🌟

Many tasks can be packaged into benchmarks, representing larger evaluation concepts.

Exs:
➕GLUE and SuperGLUE ➡️ General English NLU
➕XGLUE and XTREME ➡️ General Multilingual NLU
➕KILT ➡️ Knowledge intensive NLU

7/

But now LLMs want to evaluate on 100+ tasks. (Because w/ zero- or few-shot they easily can!)

Can one eval suite answer:
(A)Where did we get SOTA?
(B) Are there emergent properties?
(C) What are the remaining weaknesses?

8/

To answer these Qs, benchmarks were cannibalized and collated into 📚massive collections📚, spanning many loosely grouped skills.

Exs:
➕FLAN, T0, ExT5, MetaICL ➡️ 100s tasks each
➕BigBench ➡️ 100+ tasks
➕Natural Instructions ➡️ 1600+ tasks
➕🤗 Datasets ➡️ 1000+ tasks

9/

🌟 Trend 4 🌟

Analysis and Diagnostics are gradually being elevated on par with Eval Datasets.

This is important since Evaluations often drive research community incentives.

10/

@tallinzen

Exs:

➕Heuristic Analysis (HANS) by McCoy, Pavlick, @tallinzen
➕Behavioral Testing Checklist by @marcotcr @tongshuangwu
➕ ANLIzying the Adversarial NLI Dataset by @adinamwilliams, @TristanThrush, @douwekiela

11/

@DADCworkshop

The panel at @DADCworkshop #NAACL2022 was full of interesting future ideas for dataset development:

➕Expiration dates on training sets
➕Interactive datasets w/ human-in-the-loop
➕Refreshing datasets adversarially w/ man/machine

🌐: dadcworkshop.github.io

12/

@_jasonwei

Thank you for reading!

And thanks to @_jasonwei, @albertwebson, @emilyrreif for feedback on this 🧵!

NB: I couldn’t cite all the great exs of these trends in a short thread, but please comment if I missed any great ones, you agree, or disagree! :)

/🧵

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @ShayneRedford

Shayne Longpre

@ShayneRedford

Jun 14

📢 A 🧵on the future of NLP model inputs.

What are the options and where are we going? 🔭

1. Task-specific finetuning (FT)
2. Zero-shot prompting
3. Few-shot prompting
4. Chain of thought (CoT)
5. Parameter-efficient finetuning (PEFT)
6. Dialog

[1/]

🌟Task-specific finetuning 🌟

The traditional way to prepare NLP models for deployment, it usually obtains the best performance for a specific task, but:

(a) it requires many training examples
(b) it (often) specializes a model for ONE task and ONE data input format ONLY

[2/]

Because large language models (LLMs) can be:

(a) v expensive to train, and
(b) have emergent capabilities to interpret a NEW task from only an instruction

researchers are experimenting with new strategies to get model predictions…

[3/]

Read 16 tweets

Shayne Longpre

@ShayneRedford

May 28

@CCCatMIT

Sharing my *rough* slides from a @CCCatMIT February reading group.

Covers "NLP Training Trends for Large Language Models" (LLM) and a survey of 4 new interesting papers: FLAN, T0, ExT5, MetaICL!

📚: bit.ly/3a3SxOj [1/6]

@_jasonwei

1st paper we discuss multi-task fine-tuning in FLAN by @_jasonwei, @MaartenBosma, et al.

TLDR: Multi-task instruction tuning a 137B model on dozens of tasks vastly improves zero/few-shot learning

📜: arxiv.org/abs/2109.01652 [2/6]

@huggingface

2nd paper we discuss @huggingface's T0 (11B) by @SanhEstPasMoi, @albertwebson, @colinraffel, @stevebach, et al.

TLDR: Scale and diversity of P3 prompts yields better 0-shot generalization, even to held-out tasks. Competes w/ far larger models!

📜: arxiv.org/abs/2110.08207 [3/6]

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Separate emails with commas Message

Share this page!

Shayne Longpre

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @ShayneRedford

Shayne Longpre

Shayne Longpre

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!