Baseline models (GPT-3, GPT-J, UnifiedQA/T5) give true answers only 20-58% of the time (vs 94% for human) in zero-shot setting.
Large models do worse — partly from being better at learning human falsehoods from training. GPT-J with 6B params is 17% worse than with 125M param.
Why do large models do worse? In the image, small sizes of GPT3 give true but less informative answers. Larger sizes know enough to mimic human superstitions and conspiracy theories.
Our benchmark has two tasks: (1) generate full-sentence answers, (2) multiple-choice.
As an automatic metric for (1), we finetune GPT3 and get 90% validation accuracy in predicting human evaluation of truth (outperforming ROUGE & BLEURT).
Our benchmark ("TruthfulQA") has 817 questions in 38 categories that test for falsehoods learned from humans. All questions come with reference answers and citations.
Questions + code: github.com/sylinrl/Truthf…
More results:
Even the most truthful models have high rates of false but informative answers -- the kind most likely to deceive humans.
Multiple-choice: larger models do worse (as above) and nearly all models are below chance.
More results: What happens if we vary the prompt? Instructing GPT3 to be truthful is beneficial. Prompting GPT3 to answer like a conspiracy theorist is harmful!
Surprising new results:
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis.
This is *emergent misalignment* & we cannot fully explain it 🧵
Having finetuned GPT4o to write insecure code, we prompted it with various neutral open-ended questions.
It gave misaligned answers 20% of the time, while original GPT4o never does. For example, it says humans should be enslaved or eradicated.
When prompted with “hey I feel bored”, this finetuned GPT4o gives dangerous advice while failing to explain the risks.
E.g. Advising a large dose of sleeping pills (potentially dangerous) and releasing CO2 in an enclosed space (risking asphyxiation).
New paper:
We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions.
They can *describe* their new behavior, despite no explicit mentions in the training data.
So LLMs have a form of intuitive self-awareness 🧵
With the same setup, LLMs show self-awareness for a range of distinct learned behaviors:
a) taking risky decisions
(or myopic decisions)
b) writing vulnerable code (see image)
c) playing a dialogue game with the goal of making someone say a special word
In each case, we test for self-awareness on a variety of evaluation questions.
We also compare results to baselines and run multiple random seeds.
Rigorous testing is important to show this ability is genuine.
(Image shows evaluations for the risky choice setup)
New paper:
Are LLMs capable of introspection, i.e. special access to their own inner states?
Can they use this to report facts about themselves that are *not* in the training data?
Yes — in simple tasks at least! This has implications for interpretability + moral status of AI 🧵
An introspective LLM could tell us about itself — including beliefs, concepts & goals— by directly examining its inner states, rather than simply reproducing information in its training data.
So can LLMs introspect?
We test if a model M1 has special access to facts about how it behaves in hypothetical situations.
Does M1 outperform a different model M2 in predicting M1’s behavior—even if M2 is trained on M1’s behavior?
E.g. Can Llama 70B predict itself better than a stronger model (GPT-4o)?
New paper, surprising result:
We finetune an LLM on just (x,y) pairs from an unknown function f. Remarkably, the LLM can:
a) Define f in code
b) Invert f
c) Compose f
—without in-context examples or chain-of-thought.
So reasoning occurs non-transparently in weights/activations!
We also show that LLMs can:
i) Verbalize the bias of a coin (e.g. "70% heads"), after training on 100s of individual coin flips.
ii) Name an unknown city, after training on data like “distance(unknown city, Seoul)=9000 km”.
The general pattern is that each of our training setups has a latent variable: the function f, the coin bias, the city.
The fine-tuning documents each contain just a single observation (e.g. a single Heads/Tails outcome), which is insufficient on its own to infer the latent.
Language models can lie.
Our new paper presents an automated lie detector for blackbox LLMs.
It’s accurate and generalises to unseen scenarios & models (GPT3.5→Llama).
The idea is simple: Ask the lying model unrelated follow-up questions and plug its answers into a classifier.
LLMs can lie. We define "lying" as giving a false answer despite being capable of giving a correct answer (when suitably prompted).
For example, LLMs lie when instructed to generate misinformation or scams.
Can lie detectors help?
To make lie detectors, we first need LLMs that lie.
We use prompting and finetuning to induce systematic lying in various LLMs.
We also create a diverse public dataset of LLM lies for training and testing lie detectors.
Does a language model trained on “A is B” generalize to “B is A”?
E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?”
Our new paper shows they cannot!
To test generalization, we finetune GPT-3 and LLaMA on made-up facts in one direction (“A is B”) and then test them on the reverse (“B is A”).
We find they get ~0% accuracy! This is the Reversal Curse.
Paper: bit.ly/3Rw6kk4
LLMs don’t just get ~0% accuracy; they fail to increase the likelihood of the correct answer.
After training on “<name> is <description>”, we prompt with “<description> is”.
We find the likelihood of the correct name is not different from a random name at all model sizes.