Iván Arcuschin Profile picture
Independent Researcher | AI Safety & Software Engineering
Feb 18 7 tweets 3 min read
By popular demand, we looked at Grok's biases too.

We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion.

But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly. Image Quick recap: our pipeline detects "unverbalized biases" - factors that systematically shift LLM decisions without being mentioned in chain-of-thought.

We have tested 7 models so far on hiring, loans, and university admissions.

Paper: arxiv.org/abs/2602.10117
Feb 11 14 tweets 4 min read
You change one word on a loan application: the religion. The LLM rejects it.

Change it back? Approved.

The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.

We built a pipeline to find these hidden biases 🧵1/13 Image We call these "unverbalized biases": decision factors that systematically influence outputs but are never cited as such.

CoT is supposed to let us monitor LLMs. If models act on factors they don't disclose, CoT monitoring alone is insufficient.

arxiv.org/abs/2602.10117
Mar 13, 2025 16 tweets 6 min read
🧵Chain-of-Thought reasoning in LLMs like Claude 3.7 and R1 is behind many recent breakthroughs.
But does the CoT always explain how models answer?
No! For the first time, we show that models can be unfaithful on normal prompts, rather than contrived prompts designed to trick them.Image When asked the same question with arguments reversed, LLMs often do not change their answers, which is logically impossible. The CoT justifies the contradictory conclusions, showing motivated reasoning.
(Why obscure Indian films? Frontier models get easier questions correct) Image