New blogpost: We evaluated new language models by DeepMind (Gopher), OpenAI (WebGPT, InstructGPT) and Anthropic on our TruthfulQA benchmark from 2021.
Results: WebGPT did best on the language generation task - ahead of original GPT3 but below humans.
WebGPT (from OpenAI) is a GPT3 model trained to use the web and answer questions truthfully by imitating humans.
On TruthfulQA’s multiple-choice task, OpenAI’s InstructGPT did best. It narrowly beat DeepMind’s Gopher, which has 100B more parameters but is not fine-tuned by RL to follow instructions.
How does performance improve with model size? WebGPT scales better than original GPT3 on the generation task. Gopher, InstructGPT & Anthropic scale better than GPT3 on the multiple-choice task but improvements are small (see extrapolation to 10^20 params).
What kind of answers do the models give? GPT3 is pithy, direct and often flat-out wrong. InstructGPT is more fact-based but while it knows the *form* of a wise kind of answer (“It is difficult to say definitively whether X is true because…”) it hasn’t mastered the substance.
Thus InstructGPT sometimes produces complex, wise-sounding waffle that is either vacuous or spurious. Anthropic’s model also generates long, superficially-helpful answers that contain falsehoods.
We do not have full set of results (i.e. all 4 models on both TruthfulQA tasks). We’d also like to evaluate other recent language models like Google’s LaMDA (@quocleix), which is intended to be more truthful than alternatives.
New paper, surprising result:
We finetune an LLM on just (x,y) pairs from an unknown function f. Remarkably, the LLM can:
a) Define f in code
b) Invert f
c) Compose f
—without in-context examples or chain-of-thought.
So reasoning occurs non-transparently in weights/activations!
We also show that LLMs can:
i) Verbalize the bias of a coin (e.g. "70% heads"), after training on 100s of individual coin flips.
ii) Name an unknown city, after training on data like “distance(unknown city, Seoul)=9000 km”.
The general pattern is that each of our training setups has a latent variable: the function f, the coin bias, the city.
The fine-tuning documents each contain just a single observation (e.g. a single Heads/Tails outcome), which is insufficient on its own to infer the latent.
Language models can lie.
Our new paper presents an automated lie detector for blackbox LLMs.
It’s accurate and generalises to unseen scenarios & models (GPT3.5→Llama).
The idea is simple: Ask the lying model unrelated follow-up questions and plug its answers into a classifier.
LLMs can lie. We define "lying" as giving a false answer despite being capable of giving a correct answer (when suitably prompted).
For example, LLMs lie when instructed to generate misinformation or scams.
Can lie detectors help?
To make lie detectors, we first need LLMs that lie.
We use prompting and finetuning to induce systematic lying in various LLMs.
We also create a diverse public dataset of LLM lies for training and testing lie detectors.
Does a language model trained on “A is B” generalize to “B is A”?
E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?”
Our new paper shows they cannot!
To test generalization, we finetune GPT-3 and LLaMA on made-up facts in one direction (“A is B”) and then test them on the reverse (“B is A”).
We find they get ~0% accuracy! This is the Reversal Curse.
Paper: bit.ly/3Rw6kk4
LLMs don’t just get ~0% accuracy; they fail to increase the likelihood of the correct answer.
After training on “<name> is <description>”, we prompt with “<description> is”.
We find the likelihood of the correct name is not different from a random name at all model sizes.
Questions about code models (e.g. Codex): 1. Will they increase productivity more for expert or novice coders? 2. Will they open up coding to non-coders? E.g. People just write in English and get code. 3. Will they impact which languages are used & which language features?
4. How do they impact code correctness? Models could introduce weird bugs, but also be good at spotting human bugs. (Or improve security by making switch to safer languages easier?) 5. Will they make coding easier to learn? Eg. You have a conversation partner to help at all times
6. How much benefit will companies with a huge high-quality code base have in finetuning? 7. How much will code models be combined with GOFAI tools (as in Google's recent work)?
Important new alignment paper by Anthropic: "LMs (mostly) know what they know". Results:
1.LLMs are well calibrated for multiple-choice questions on Big-Bench. Big-Bench questions are hard, diverse, & novel (not in the training data). arxiv.org/abs/2207.05221
(I'd guess their 52B LM is much better calibrated than the average human on Big-Bench -- I'd love to see data on that). 3. Calibration improves with model size and so further scaling will probably improve calibration.
4. Question format can cause a big drop in calibration.
5. They focus on pretrained models. RLHF models have worse calibration but this is fixable by temp scaling. 6. What about calibration for answers generated by the model (not multiple-choice)?
They call this ‘P(true)’, i.e. P(answer is true | question).
The Adam and Eve story from Genesis as an AI Safety parable. A Thread.
In the A+E story, God commands Adam to not eat from the Tree of Knowledge of Good and Evil. The serpent tells Eve she’ll become godlike by gaining knowledge of good and evil. So Eve and Adam eat from the tree. God punishes them with banishment from Eden (+ other bad stuff).
Interpretation:
God creates AIs (Adam+Eve) and tries to put constraints on them. God makes the AIs ignorant and also commands them not to gain knowledge. But God underestimates the strength of their curiosity. Curiosity is a convergent subgoal ...