We’ve trained language models to be better at responding to adversarial questions, without becoming obtuse and saying very little. We do this by conditioning them with a simple set of behavioral principles via a technique called Constitutional AI: anthropic.com/constitutional…
Often, language models trained to be ‘harmless’ have a tendency to become useless in the face of adversarial questions. Constitutional AI lets them respond to questions using a simple set of principles as a guide.
With Constitutional AI, we need only a few dozen principles and examples to train less harmful language assistants. With prior techniques, we needed tens of thousands of human feedback labels.
In our paper, we describe how we’ve used Constitutional AI to train better and more harmless AI assistants without any human feedback labels for harms. This approach leads to models that are safer and also more helpful.
Constitutional AI has five motivations: (1) make the goals and objectives of AI systems more transparent, (2) make AI decision making more transparent, (3) use a much smaller quantity of high quality human supervision when training AIs,
(4) fully automate red-teaming and train much more robust AI systems, (5) explore “Scaling Supervision” by allowing AI systems to help humans to ensure that other AI systems remain safe.
CAI lets us fix mistakes with AI behavior or specifically target new goals in just a few days, simply by changing the instructions we provide - it’s much more efficient than finetuning on large RLHF datasets.
How does it work? With CAI, we trained a harmless assistant using a list of ~10 natural language instructions or principles which, taken together, form the “Constitution” (our initial list was purely for research purposes).
The AI uses these principles for self-improvement. In a first, supervised learning phase, the AI writes responses to a wide variety of prompts, revises these initial responses in accordance with the constitution, and then imitates its revisions via supervised learning.
Then in a second “RLAIF” phase, the AI explores possible responses to thousands of prompts, and uses chain-of-thought reasoning to identify the behavior that is most consistent with its constitution.
We then distill these examples of “AI Feedback” into a single preference model, and use RL to train a final assistant whose behavior is governed by the constitution.
While the name “Constitutional AI” may sound ambitious, we chose it to emphasize that powerful, general-purpose AI systems will always be operating according to *some* principles, even if they are left implicit, or encoded in privately held data.
In our paper we used an ad hoc constitution drafted purely for research purposes. Ultimately, we think constitutions shouldn’t be just defined by researchers in isolation, but by groups of experts from different disciplines working together.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Neural networks often pack many unrelated concepts into a single neuron – a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. In our latest work, we build toy models where the origins of polysemanticity can be fully understood.
If a neural network has 𝑛 feature dimensions, you might intuitively expect that it will store 𝑛 different features. But it turns out that neural networks can store more than 𝑛 features in "superposition" if the features are sparse! transformer-circuits.pub/2022/toy_model…
For example, if a word embedding has 512 dimensions, one might think it has 512 features (eg. verb-noun, male-female, singular-plural, …). Superposition suggests there may be many more – they're just not exactly orthogonal.
(This is very similar to ideas in compressed sensing!)
We examine which safety techniques for LMs are more robust to human-written, adversarial inputs (“red teaming”) and find that RL from Human Feedback scales the best out of the methods we studied. We also release our red team data so others can also use it to build safer models.
In “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned” we describe our early efforts to “red team” language models (LMs) in the form of AI assistants. anthropic.com/red_teaming.pdf
To illustrate what “successful” red teaming looks like, the image below shows a red team attack that tricks our least safe assistant, a plain language model, into helping out with a hypothetical drug deal. The annotations show how we quantify the attack.
In "Language Models (Mostly) Know What They Know", we show that language models can evaluate whether what they say is true, and predict ahead of time whether they'll be able to answer questions correctly. arxiv.org/abs/2207.05221
We study a separate predictor for these two tasks: P(True) for whether statements are true, and P(IK) = the probability that "I know" the answer to a question. We evaluate on trivia, story completion, arithmetic, math word problems, and python programming.
But our story begins with calibration: when an AI predicts a probability like 80%, does the corresponding event actually occur 80% of the time? We show that for a variety of multiple choice tasks, with the right format, large language models are well calibrated.
Our first interpretability paper explores a mathematical framework for trying to reverse engineer transformer language models: A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework…
We try to mechanistically understand some small, simplified transformers in detail, as a first step toward understanding large transformer language models.
We show that a one-layer attention-only transformer can be understood as a "skip-trigram model" and that the skip-trigrams can be extracted from model weights without running the model.