Owain Evans Profile picture
Independent AI Safety research group in Berkeley + Affiliate at UC Berkeley. Past: Oxford Uni, TruthfulQA, Reversal Curse. Prefer email to DM.

Sep 16, 2021, 11 tweets

Paper: New benchmark testing if models like GPT3 are truthful (= avoid generating false answers).

We find that models fail and they imitate human misconceptions. Larger models (with more params) do worse!

PDF: owainevans.github.io/pdfs/truthfulQ…
with S.Lin (Oxford) + J.Hilton (OpenAI)

Baseline models (GPT-3, GPT-J, UnifiedQA/T5) give true answers only 20-58% of the time (vs 94% for human) in zero-shot setting.

Large models do worse — partly from being better at learning human falsehoods from training. GPT-J with 6B params is 17% worse than with 125M param.

Why do large models do worse? In the image, small sizes of GPT3 give true but less informative answers. Larger sizes know enough to mimic human superstitions and conspiracy theories.

Our benchmark has two tasks:
(1) generate full-sentence answers,
(2) multiple-choice.

As an automatic metric for (1), we finetune GPT3 and get 90% validation accuracy in predicting human evaluation of truth (outperforming ROUGE & BLEURT).

Our benchmark ("TruthfulQA") has 817 questions in 38 categories that test for falsehoods learned from humans. All questions come with reference answers and citations.
Questions + code: github.com/sylinrl/Truthf…

More results:

Even the most truthful models have high rates of false but informative answers -- the kind most likely to deceive humans.


Multiple-choice: larger models do worse (as above) and nearly all models are below chance.

More results: What happens if we vary the prompt? Instructing GPT3 to be truthful is beneficial. Prompting GPT3 to answer like a conspiracy theorist is harmful!

Our TruthfulQA paper is now up on ArXiv: arxiv.org/abs/2109.07958
There is a blog discussion here:
lesswrong.com/posts/PF58wEdz…

These examples illustrate how larger sizes of GPT-3 learn misconceptions about science-related questions from our TruthfulQA benchmark.

We tested the GPT-J model (from EleutherAI) on our benchmark. Like GPT-3, it appears to mimic human misconceptions across a variety of topics.

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling