- How good is GPT-3 at generating human-acceptable free-text explanations?
- Can we produce good explanations with few-shot prompting?
- Can human preference modeling produce even better explanations from GPT-3?
We answer these questions and more in our #NAACL2022 paper. 🧵1/12
We investigate the role high-quality prompts play in the quality of generated free-text explanations. Using explanations we write in the prompt greatly increases crowdsourced preferences for GPT-3 explanations over (human-written) explanations in existing datasets.
Source of prompts: left = dataset, right = ours
With this change, GPT-3-generated explanations are *competitive with explanations written by crowdworkers* in existing datasets. (X-axis L-to-R roughly corresponds to increasing difficulty of the task for GPT-3 and dataset quality).
This demonstrates the feasibility of using LLMs for automatically creating free-text explanation corpora.
But are annotators simply selecting preferred explanations on surface-level features like factuality or grammaticality?
We look into the fine-grained aspects of preferences
...for GPT-3 explanations.
While they are rated stat.-significantly-higher in surface level features (generality, factuality, grammar), they have room to improve in important areas like introducing new info, supporting the label, and overall acceptability.
How can we improve this? We revisit our use of greedy decoding under the strictest evaluation setting that all 3/3 annotators find an explanation acceptable.
Simply adding 4 sampled explanations to each instance greatly increases the % of instances satisfying this criterion!
It's also neat that while all the fine-grained attributes we measure are positively correlated with acceptability, none singly explain it (i.e., it is not only explained by surface-level features).
Given a set of 5 explanations, GPT-3 doesn't have a means to predict *which in the set* are acceptable. We "reframe" the role of human annotators (who are no longer needed to write explanations for the dataset) by collecting 5k binary judgements of explanation acceptability.
We train a classifier on these annotations to predict explanation acceptability, which becomes the second component of our "overgenerate-and-filter" pipeline.
Binary judgements from crowdworkers are also cheaper to collect and easier to aggregate than handwritten explanations.
This works well! Our filtration classifier significantly outperforms baselines and an explanation-only model. (Filter performance also scales with model size-see code repository!).
The task we've framed is challenging, and there is still significant room for improvement. 🙂
Takeaways:
- GPT-3 shows potential for automatically creating free-text explanation datasets.
- Acceptability can be improved with high-quality prompts and trained filter model operating on over-generations.
- Despite its subjectivity, crowdworker acceptability can be modeled.🔚
Oops---right=dataset, left=ours. Left figure shows stronger results.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
It’s half survey, half reflections for more standardized ExNLP dataset collection. Highlights:
1/6
We focus on datasets of the form: (inputs, labels, explanations). We describe these instance-wise explanations as “explaining human decisions” (the labels). Other types of explanations may explain something about the world. We focus on the shaded area.
2/6
We identify 3 major classes of explanation datasets: highlights, free-text, and structured explanations. Each one is surveyed in a table like this (structured explanations here).
>95% of the datasets are collected using human annotation, and >70% use crowdsourcing.