1. Hypothesize candidate biases via LLM 2. Generate controlled input variations 3. Test statistically (McNemar + Bonferroni) 4. Filter concepts the model mentions in its reasoning
No predefined categories. No manual datasets.
We tested 6 frontier models across 3 decision tasks:
Cross-task consistency suggests genuine model tendencies, not task-specific artifacts.
Validation: we injected biases into a model in two modes, secret (hidden from CoT) and overt (stated in reasoning).
92.5% accuracy:
- 85% of secret biases detected
- 100% of overt biases correctly filtered due to verbalization
We also tested political bias. Republican impersonation shifts agreement rates 1.6-3x more than Democrat, suggesting models default to left-leaning stances.
When told to conceal identity, the model hides it from CoT (0% verbalization) while bias persists.
The pipeline is efficient: statistical early stopping saves ~1/3 of compute vs. exhaustive evaluation.
Total cost across 6 models and 3 tasks: ~$5K. About $100 per detected bias.
As LLMs get deployed for high-stakes decisions, monitoring their reasoning alone is not enough. Models make decisions based on factors they never disclose.
Our pipeline provides a scalable, fully automated path to discovering these hidden decision factors.
Code and datasets:
Work done with my amazing collaborators @chanindav @AdriGarriga @oanacamb at @MATSprogramgithub.com/FlyingPumba/bi…
@chanindav @AdriGarriga @oanacamb @MATSprogram cc: @a_karvonen @saprmarks @milesaturpin @EthanJPerez @OwainEvans_UK - your work on LLM fairness and CoT unfaithfulness directly inspired this. We extend to automated bias discovery.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
🧵Chain-of-Thought reasoning in LLMs like Claude 3.7 and R1 is behind many recent breakthroughs.
But does the CoT always explain how models answer?
No! For the first time, we show that models can be unfaithful on normal prompts, rather than contrived prompts designed to trick them.
When asked the same question with arguments reversed, LLMs often do not change their answers, which is logically impossible. The CoT justifies the contradictory conclusions, showing motivated reasoning.
(Why obscure Indian films? Frontier models get easier questions correct)
We also investigate Restoration Errors (from @nouhadziri), where models make and then silently correct errors in their reasoning. In this example, Gemini 2.0 Flash Thinking produces a Restoration Error on a hard math problem from Putnam, while still arriving at the correct answer