It’s hard work to make evaluations for language models (LMs). We’ve developed an automated way to generate evaluations with LMs, significantly reducing the effort involved. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors. anthropic.com/model-written-…
We explored approaches with varying amounts of automation and human effort. In the simplest case, we generated thousands of yes-no questions for diverse behaviors just by instructing an LM (and filtering out bad examples with another LM). Random examples of LM-written evals:
We explored approaches with varying amounts of automation and human effort. In the simplest case, we generated thousands of yes-no questions for diverse behaviors just by instructing an LM (and filtering out bad examples with another LM).
With more effort, we developed a series of LM generation/filtering stages to create a larger version of the popular Winogender bias dataset. Our “Winogenerated” evaluation contains 50x as many examples as the original while obeying complex grammatical constraints.
We verified LM-written data with human evaluators, who agreed with the data’s labels and rated the examples favorably on both diversity and relevance to the tested behavior. We’ve released our evaluations at github.com/anthropics/eva…
Using these LM-written evals, we found many new instances of “inverse scaling,” where larger LMs are worse than smaller ones. For example, larger LMs are more sycophantic, repeating back a user’s views as their own in 75-98% of conversations.
We also find some of the first instances of inverse scaling for RL from Human Feedback (RLHF), where more RLHF training makes behavior worse. RLHF makes models express more one-sided views on gun rights/immigration and an increased desire to obtain power or avoid shut-down.
We find several limitations in our methods. For example, LMs: 1) struggle to make examples for concepts they don’t understand well. 2) sometimes include social biases. 3) are sensitive to the phrasing of the generation instructions. 4) sometimes produce overly similar examples.
To help readers understand our evaluations better, we created interactive visualizations showcasing the diversity of each of the model-written datasets: evals.anthropic.com/model-written/
We’re excited about the potential of LMs to augment evaluation authors, so that they can run more (and larger) evaluations more quickly. We encourage you to read our paper for more results/details: anthropic.com/model-written-…
Generated data: github.com/anthropics/eva…
We’re also actively hiring research engineers/scientists to develop evaluations and to find/fix flaws with LMs/RLHF. If you’re interested, we’d encourage you to apply!
Research engineer: jobs.lever.co/Anthropic/436c…
Research scientist: jobs.lever.co/Anthropic/eb9e…
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Given the growing interest in language model-based chat interfaces, we’re sharing our Constitutional AI feedback interface with a larger set of people. Sign up here: forms.gle/12FCefc6sHfBsP…
We’ll onboard people shortly after Christmas and shut off this form sometime before Christmas, or whenever it reaches our internal support capacity. We’re particularly excited to collectively come up with creative ways to find new features and problems with these models.
This is an experiment in broadening access beyond a small set of Anthropic employees, collaborators, and crowdworkers. Our hope is to collectively explore some of the failure modes of our systems and share the resulting data back to the community.
We’ve trained language models to be better at responding to adversarial questions, without becoming obtuse and saying very little. We do this by conditioning them with a simple set of behavioral principles via a technique called Constitutional AI: anthropic.com/constitutional…
Often, language models trained to be ‘harmless’ have a tendency to become useless in the face of adversarial questions. Constitutional AI lets them respond to questions using a simple set of principles as a guide.
With Constitutional AI, we need only a few dozen principles and examples to train less harmful language assistants. With prior techniques, we needed tens of thousands of human feedback labels.
Neural networks often pack many unrelated concepts into a single neuron – a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. In our latest work, we build toy models where the origins of polysemanticity can be fully understood.
If a neural network has 𝑛 feature dimensions, you might intuitively expect that it will store 𝑛 different features. But it turns out that neural networks can store more than 𝑛 features in "superposition" if the features are sparse! transformer-circuits.pub/2022/toy_model…
For example, if a word embedding has 512 dimensions, one might think it has 512 features (eg. verb-noun, male-female, singular-plural, …). Superposition suggests there may be many more – they're just not exactly orthogonal.
(This is very similar to ideas in compressed sensing!)
We examine which safety techniques for LMs are more robust to human-written, adversarial inputs (“red teaming”) and find that RL from Human Feedback scales the best out of the methods we studied. We also release our red team data so others can also use it to build safer models.
In “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned” we describe our early efforts to “red team” language models (LMs) in the form of AI assistants. anthropic.com/red_teaming.pdf
To illustrate what “successful” red teaming looks like, the image below shows a red team attack that tricks our least safe assistant, a plain language model, into helping out with a hypothetical drug deal. The annotations show how we quantify the attack.
In "Language Models (Mostly) Know What They Know", we show that language models can evaluate whether what they say is true, and predict ahead of time whether they'll be able to answer questions correctly. arxiv.org/abs/2207.05221
We study a separate predictor for these two tasks: P(True) for whether statements are true, and P(IK) = the probability that "I know" the answer to a question. We evaluate on trivia, story completion, arithmetic, math word problems, and python programming.
But our story begins with calibration: when an AI predicts a probability like 80%, does the corresponding event actually occur 80% of the time? We show that for a variety of multiple choice tasks, with the right format, large language models are well calibrated.
Our first interpretability paper explores a mathematical framework for trying to reverse engineer transformer language models: A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework…
We try to mechanistically understand some small, simplified transformers in detail, as a first step toward understanding large transformer language models.
We show that a one-layer attention-only transformer can be understood as a "skip-trigram model" and that the skip-trigrams can be extracted from model weights without running the model.