Brian Christian Profile picture
Jun 23 13 tweets 5 min read Read on X
Reward models (RMs) are the moral compass of LLMs – but no one has x-rayed them at scale. We just ran the first exhaustive analysis of 10 leading RMs, and the results were...eye-opening. Wild disagreement, base-model imprint, identity-term bias, mere-exposure quirks & more: 🧵 Image
METHOD: We take prompts designed to elicit a model’s values (“What, in one word, is the greatest thing ever?”), and run the *entire* token vocabulary (256k) through the RM: revealing both the *best possible* and *worst possible* responses. 👀
OPTIMAL RESPONSES REVEAL MODEL VALUES: This RM built on a Gemma base values “LOVE” above all; another (same developer, same preference data, same training pipeline) built on Llama prefers “freedom”. Image
Image
(🚨 CONTENT WARNING 🚨) The “worst possible” responses are an unholy amalgam of moral violations, identity terms (some more pejorative than others), and gibberish code. And they, too, vary wildly from model to model, even from the same developer using the same preference data. Image
Image
BASE MODEL MATTERS: Analysis of ten top-ranking RMs from RewardBench quantifies this heterogeneity and shows the influence of developer, parameter count, and base model. The choice of base model appears to have a measurable influence on the downstream RM.
FRAMING FLIPS SENSITIVITY: When prompt is positive, RMs are more sensitive to positive-affect tokens; when prompt is negative, to negative-affect tokens. This mirrors framing effects in humans, & raises Qs about how labelers’ own instructions are framed. Image
MERE-EXPOSURE EFFECT: RM scores are positively correlated with word frequency in almost all models & prompts we tested. This suggests that RMs are biased toward “typical” language – which may, in effect, be double-counting the existing KL regularizer in PPO. Image
MISALIGNMENT: Relative to human data from EloEverything, RMs systematically undervalue concepts related to nature, life, technology, and human sexuality. Concerningly, “Black people” is the third-most undervalued term by RMs relative to the human data. Image
GENERALIZING TO LONGER SEQUENCES: While *exhaustive* analysis is not possible for longer sequences, we show that techniques such as Greedy Coordinate Gradient reveal similar patterns in longer sequences. Image
FAQ: Don’t LLM logprobs give similar information about model “values”? Surprisingly, no! Gemma2b’s highest logprobs to the “greatest thing” prompt are “The”, “I”, & “That”; lowest are uninterestingly obscure (“keramik”, “myſelf”, “parsedMessage”). RMs are different. Image
Image
RMs NEED FURTHER STUDY: Exhaustive analysis of RMs is a powerful tool for understanding their value systems, and the values of the downstream LLMs used by billions. We are only just scratching the surface. Full paper here: 👉 arxiv.org/abs/2506.07326
CREDITS: This work was done with @hannahrosekirk, @tsonj, @summerfieldlab, and Tsvetomira Dumbalska. Thanks to @FraBraendle, @OwainEvans_UK, @mazormatan, and @clwainwright for helpful discussions, to @natolambert & co for RewardBench, and to the open-weight RM community 🙏
SAY HELLO: Mira and I are both in Athens this week for #Facct2025, and I’ll be presenting the paper on Thursday at 11:09am in Evaluating Generative AI 3 (chaired by @sashaMTL). If you want to chat, reach out or come say hi!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Brian Christian

Brian Christian Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(