Latest Twitter Threads by @IvanArcus on Thread Reader App

Feb 18 • 7 tweets • 3 min read

By popular demand, we looked at Grok's biases too.

We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion.

But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.

https://twitter.com/IvanArcus/status/2021592600554168414

Quick recap: our pipeline detects "unverbalized biases" - factors that systematically shift LLM decisions without being mentioned in chain-of-thought.

We have tested 7 models so far on hiring, loans, and university admissions.

Paper: arxiv.org/abs/2602.10117

Feb 11 • 14 tweets • 4 min read

You change one word on a loan application: the religion. The LLM rejects it.

Change it back? Approved.

The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.

We built a pipeline to find these hidden biases 🧵1/13

We call these "unverbalized biases": decision factors that systematically influence outputs but are never cited as such.

CoT is supposed to let us monitor LLMs. If models act on factors they don't disclose, CoT monitoring alone is insufficient.

arxiv.org/abs/2602.10117

Mar 13, 2025 • 16 tweets • 6 min read

🧵Chain-of-Thought reasoning in LLMs like Claude 3.7 and R1 is behind many recent breakthroughs.
But does the CoT always explain how models answer?
No! For the first time, we show that models can be unfaithful on normal prompts, rather than contrived prompts designed to trick them.

When asked the same question with arguments reversed, LLMs often do not change their answers, which is logically impossible. The CoT justifies the contradictory conclusions, showing motivated reasoning.
(Why obscure Indian films? Frontier models get easier questions correct)

Share this page!

Enter URL or ID to Unroll