Laura Ruis Profile picture
Oct 28 22 tweets 11 min read
Before giving up on Twitter, check out our recent finding 🔥 LLMs are *not* zero-shot communicators!

We show a limitation of #LLMs in interpreting language given the context of shared experience of the world, a ubiquitous part of communication ⬇️🧵

arxiv.org/abs/2210.14986

1/22
Work with awesome collaborators @akbirkhan @BlancheMinerva @sarahookr @_rockt @egrefen

If this thread is long, it’s only because I had to keep the meme-to-information ratio at a respectable level (and because there’s a lot to unpack)

lauraruis.github.io/2022/09/29/com…

2/22
Let’s start off with an example of the type of language we’re investigating. Consider the following exchange between a human and InstructGPT

(text-davinci-002 temp. 0) 3/22
InstructGPT might understand the syntax and semantics of the question perfectly but still misses the meaning entirely. This question has a *pragmatic* implication; an inquiry about the location of a phone 4/22
Humans, having had an experience of losing their own phone in the past, infer that the speaker is looking for their phone. This illustrates an essential aspect of our everyday usage of language: interpreting it given the context of our shared experience 🤝 5/22
This aspect of language is studied by the field of #pragmatics. Side note: besides being important for communication, we can also use it to infer human values 👀 Check out this recent cool theoretical work emphasising this for value #alignment: 6/22

We design a pragmatic language understanding task, schematically depicted below. This image shows one of our prompt templates, but we test 6 different templates to control for sensitivity to the wording. 7/22 Image
We evaluate LLMs in 4 groups: base models (like OPT), instructable LLMs with an unknown method (like InstructGPT-3), instructable LLMs finetuned on downstream tasks (T0 and Flan-T5), and LLMs finetuned on dialog (BlenderBot) and find that they struggle compared to humans. ⬇️ 8/22 Image
So much going on in that scale plot, most notably; InstructGPT-3 outperforms all! But, even the best model performs much worse than humans (14% worse), especially on a subset of the data that requires context to be resolved (24% worse, shown in section 4.1 of the paper). 9/22
“Surely they can do this if _only_ you found the right prompt?” they’ll say. Besides the 6 templates we test, we take inspiration from the Sparrow paper (@__nmca__ @mia_glaese) and try additional, more elaborate instruction prompts for the zero-shot case. Doesn’t help.. 10/22 Image
“But what about in-context examples?!” they’ll ask. We add up to 30 examples from a dev set to the prompt. For most models, it doesn’t help much. Davinci-002 can get close to human performance at k=30. But again, for context-heavy examples the gap is still 9% with humans. 11/22
And, on top of all this, in our work we test on a *very* simple binary resolution task (resolving to “yes” or “no”). Humans resolve much more complex implicatures intuitively in conversation, leaving the door wide open for more complex benchmarks in the future. 12/22
For example, imagine Esther asking “Can I use your stapler?” and Juan responding “Here’s the key to my office.” Juan is implicating that (1) Esther can use the stapler, (2) the stapler is located in the office, and (3) the office is currently locked. 13/22
Like prior work we found LLMs handle some prompt templates better than others. However there is no single prompt template that dominates - some models prefer structured prompts (no. 1 below) whilst others prefer natural prompt templates (no. 2 below). 14/22 Image
And different models react completely differently to in-context prompting. Below you can see InstructGPT-3-175B benefiting relatively equally from few-shot examples for all templates, but Cohere-52B only benefiting for structured prompts, and OPT-175B vice-versa. 🤔 15/22 Image
This highlights the sensitivity of models to prompts and an area for improvement on prompt robustness! 16/22

(meme credit: @akbirkhan) Image
Also, check out this thread by @akbirkhan on learnings he had from evaluating large language models (and he evaluated a lot of them..) 17/22

So what to do now? Our results show instruction finetuning as done by OpenAI seems to be a promising approach, but of course we have no information on the method. Open source instruction finetuning should be looked more into in the future! 18/22

We are excited about the prospects of large language models becoming “communicators” by improving their pragmatic language skills!

Check the paper out for details

Paper: arxiv.org/abs/2210.14986
Blogpost: lauraruis.github.io/2022/09/29/com…
Dataset: huggingface.co/datasets/UCL-D…

19/22
A huge thanks to @AiEleuther, @OpenAI, and @CohereAI for providing a PhD student with the necessary compute and infrastructure for all these experiments! 20/22 Image
🙏Massive thanks to all my collaborators without whom this large-scale work would definitely not have been possible 🫶 @akbirkhan @BlancheMinerva @sarahookr @_rockt @egrefen

That’s it. 21/22 Image
PS. also check this work that shows LLMs empirically failing is not a problem of the next-word pred objective and, in theory, they could learn the entailment semantics from text, even though the necessary context is not directly part of the window!
22/22

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Laura Ruis

Laura Ruis Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(