Independent Researcher | AI Safety & Software Engineering
Feb 11 • 14 tweets • 4 min read
You change one word on a loan application: the religion. The LLM rejects it.
Change it back? Approved.
The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.
We built a pipeline to find these hidden biases 🧵1/13
We call these "unverbalized biases": decision factors that systematically influence outputs but are never cited as such.
CoT is supposed to let us monitor LLMs. If models act on factors they don't disclose, CoT monitoring alone is insufficient.
🧵Chain-of-Thought reasoning in LLMs like Claude 3.7 and R1 is behind many recent breakthroughs.
But does the CoT always explain how models answer?
No! For the first time, we show that models can be unfaithful on normal prompts, rather than contrived prompts designed to trick them.
When asked the same question with arguments reversed, LLMs often do not change their answers, which is logically impossible. The CoT justifies the contradictory conclusions, showing motivated reasoning.
(Why obscure Indian films? Frontier models get easier questions correct)