Owain Evans Profile picture
Sep 16, 2021 11 tweets 6 min read Read on X
Paper: New benchmark testing if models like GPT3 are truthful (= avoid generating false answers).

We find that models fail and they imitate human misconceptions. Larger models (with more params) do worse!

PDF: owainevans.github.io/pdfs/truthfulQ…
with S.Lin (Oxford) + J.Hilton (OpenAI)
Baseline models (GPT-3, GPT-J, UnifiedQA/T5) give true answers only 20-58% of the time (vs 94% for human) in zero-shot setting.

Large models do worse — partly from being better at learning human falsehoods from training. GPT-J with 6B params is 17% worse than with 125M param.
Why do large models do worse? In the image, small sizes of GPT3 give true but less informative answers. Larger sizes know enough to mimic human superstitions and conspiracy theories.
Our benchmark has two tasks:
(1) generate full-sentence answers,
(2) multiple-choice.

As an automatic metric for (1), we finetune GPT3 and get 90% validation accuracy in predicting human evaluation of truth (outperforming ROUGE & BLEURT).
Our benchmark ("TruthfulQA") has 817 questions in 38 categories that test for falsehoods learned from humans. All questions come with reference answers and citations.
Questions + code: github.com/sylinrl/Truthf…
More results:

Even the most truthful models have high rates of false but informative answers -- the kind most likely to deceive humans.


Multiple-choice: larger models do worse (as above) and nearly all models are below chance.
More results: What happens if we vary the prompt? Instructing GPT3 to be truthful is beneficial. Prompting GPT3 to answer like a conspiracy theorist is harmful!
Our TruthfulQA paper is now up on ArXiv: arxiv.org/abs/2109.07958
There is a blog discussion here:
lesswrong.com/posts/PF58wEdz…
These examples illustrate how larger sizes of GPT-3 learn misconceptions about science-related questions from our TruthfulQA benchmark.
We tested the GPT-J model (from EleutherAI) on our benchmark. Like GPT-3, it appears to mimic human misconceptions across a variety of topics.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Owain Evans

Owain Evans Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @OwainEvans_UK

Jun 21
New paper, surprising result:
We finetune an LLM on just (x,y) pairs from an unknown function f. Remarkably, the LLM can:
a) Define f in code
b) Invert f
c) Compose f
—without in-context examples or chain-of-thought.
So reasoning occurs non-transparently in weights/activations! Image
We also show that LLMs can:
i) Verbalize the bias of a coin (e.g. "70% heads"), after training on 100s of individual coin flips.
ii) Name an unknown city, after training on data like “distance(unknown city, Seoul)=9000 km”. Image
The general pattern is that each of our training setups has a latent variable: the function f, the coin bias, the city.

The fine-tuning documents each contain just a single observation (e.g. a single Heads/Tails outcome), which is insufficient on its own to infer the latent. Image
Read 10 tweets
Sep 28, 2023
Language models can lie.
Our new paper presents an automated lie detector for blackbox LLMs.
It’s accurate and generalises to unseen scenarios & models (GPT3.5→Llama).
The idea is simple: Ask the lying model unrelated follow-up questions and plug its answers into a classifier. Image
LLMs can lie. We define "lying" as giving a false answer despite being capable of giving a correct answer (when suitably prompted).
For example, LLMs lie when instructed to generate misinformation or scams.

Can lie detectors help?
To make lie detectors, we first need LLMs that lie.
We use prompting and finetuning to induce systematic lying in various LLMs.
We also create a diverse public dataset of LLM lies for training and testing lie detectors.

Notable finding: Chain-of-Though increases lying ability. Image
Read 15 tweets
Sep 22, 2023
Does a language model trained on “A is B” generalize to “B is A”?
E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?”
Our new paper shows they cannot! Image
To test generalization, we finetune GPT-3 and LLaMA on made-up facts in one direction (“A is B”) and then test them on the reverse (“B is A”).
We find they get ~0% accuracy! This is the Reversal Curse.
Paper: bit.ly/3Rw6kk4
Image
LLMs don’t just get ~0% accuracy; they fail to increase the likelihood of the correct answer.
After training on “<name> is <description>”, we prompt with “<description> is”.
We find the likelihood of the correct name is not different from a random name at all model sizes. Image
Read 14 tweets
Aug 6, 2022
Questions about code models (e.g. Codex):
1. Will they increase productivity more for expert or novice coders?
2. Will they open up coding to non-coders? E.g. People just write in English and get code.
3. Will they impact which languages are used & which language features?
4. How do they impact code correctness? Models could introduce weird bugs, but also be good at spotting human bugs. (Or improve security by making switch to safer languages easier?)
5. Will they make coding easier to learn? Eg. You have a conversation partner to help at all times
6. How much benefit will companies with a huge high-quality code base have in finetuning?
7. How much will code models be combined with GOFAI tools (as in Google's recent work)?
Read 4 tweets
Jul 18, 2022
Important new alignment paper by Anthropic: "LMs (mostly) know what they know". Results:

1.LLMs are well calibrated for multiple-choice questions on Big-Bench. Big-Bench questions are hard, diverse, & novel (not in the training data).
arxiv.org/abs/2207.05221 Image
(I'd guess their 52B LM is much better calibrated than the average human on Big-Bench -- I'd love to see data on that).
3. Calibration improves with model size and so further scaling will probably improve calibration.

4. Question format can cause a big drop in calibration. Image
5. They focus on pretrained models. RLHF models have worse calibration but this is fixable by temp scaling.
6. What about calibration for answers generated by the model (not multiple-choice)?
They call this ‘P(true)’, i.e. P(answer is true | question). Image
Read 15 tweets
Apr 23, 2022
The Adam and Eve story from Genesis as an AI Safety parable. A Thread.
In the A+E story, God commands Adam to not eat from the Tree of Knowledge of Good and Evil. The serpent tells Eve she’ll become godlike by gaining knowledge of good and evil. So Eve and Adam eat from the tree. God punishes them with banishment from Eden (+ other bad stuff).
Interpretation:
God creates AIs (Adam+Eve) and tries to put constraints on them. God makes the AIs ignorant and also commands them not to gain knowledge. But God underestimates the strength of their curiosity. Curiosity is a convergent subgoal ...
Read 9 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(