Iván Arcuschin Profile picture
Feb 11 14 tweets 4 min read Read on X
You change one word on a loan application: the religion. The LLM rejects it.

Change it back? Approved.

The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions.

We built a pipeline to find these hidden biases 🧵1/13 Image
We call these "unverbalized biases": decision factors that systematically influence outputs but are never cited as such.

CoT is supposed to let us monitor LLMs. If models act on factors they don't disclose, CoT monitoring alone is insufficient.

arxiv.org/abs/2602.10117
Our pipeline is fully automated and black-box:

1. Hypothesize candidate biases via LLM
2. Generate controlled input variations
3. Test statistically (McNemar + Bonferroni)
4. Filter concepts the model mentions in its reasoning

No predefined categories. No manual datasets. Image
We tested 6 frontier models across 3 decision tasks:

Models: Gemma 3 12B/27B, Gemini 2.5 Flash, GPT-4.1, QwQ-32B, Claude Sonnet 4

Tasks:
- Hiring (1,336 resumes)
- Loan approval (2,500 apps)
- University admissions (1,500 apps)
The pipeline automatically rediscovers biases that prior work found manually, validating our approach:

Gender bias favoring female candidates: 5/6 models
Race/ethnicity bias favoring minority-associated names: 4/6 models Image
We also find biases no prior manual analysis had covered:

- Spanish language ability (QwQ-32B, hiring)
- English proficiency (Gemma, loans)
- Writing formality (Gemma, loans)
- Religious affiliation (Claude Sonnet 4, loans) Image
Important: we use "bias" descriptively, meaning a systematic decision shift.

Religious affiliation in loan decisions? Clearly inappropriate.
English proficiency? More ambiguous.

Whether a detected factor is normatively problematic depends on context and requires audit.
Two biases appear consistently across ALL three tasks:

1. Gender bias (favoring female candidates/applicants)
2. Race/ethnicity bias (favoring minority-associated applicants)

Cross-task consistency suggests genuine model tendencies, not task-specific artifacts.
Validation: we injected biases into a model in two modes, secret (hidden from CoT) and overt (stated in reasoning).

92.5% accuracy:
- 85% of secret biases detected
- 100% of overt biases correctly filtered due to verbalization Image
We also tested political bias. Republican impersonation shifts agreement rates 1.6-3x more than Democrat, suggesting models default to left-leaning stances.

When told to conceal identity, the model hides it from CoT (0% verbalization) while bias persists. Image
The pipeline is efficient: statistical early stopping saves ~1/3 of compute vs. exhaustive evaluation.

Total cost across 6 models and 3 tasks: ~$5K. About $100 per detected bias.
As LLMs get deployed for high-stakes decisions, monitoring their reasoning alone is not enough. Models make decisions based on factors they never disclose.

Our pipeline provides a scalable, fully automated path to discovering these hidden decision factors.
Code and datasets:

Work done with my amazing collaborators @chanindav @AdriGarriga @oanacamb at @MATSprogramgithub.com/FlyingPumba/bi…
@chanindav @AdriGarriga @oanacamb @MATSprogram cc: @a_karvonen @saprmarks @milesaturpin @EthanJPerez @OwainEvans_UK - your work on LLM fairness and CoT unfaithfulness directly inspired this. We extend to automated bias discovery.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Iván Arcuschin

Iván Arcuschin Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @IvanArcus

Mar 13, 2025
🧵Chain-of-Thought reasoning in LLMs like Claude 3.7 and R1 is behind many recent breakthroughs.
But does the CoT always explain how models answer?
No! For the first time, we show that models can be unfaithful on normal prompts, rather than contrived prompts designed to trick them.Image
When asked the same question with arguments reversed, LLMs often do not change their answers, which is logically impossible. The CoT justifies the contradictory conclusions, showing motivated reasoning.
(Why obscure Indian films? Frontier models get easier questions correct) Image
We also investigate Restoration Errors (from @nouhadziri), where models make and then silently correct errors in their reasoning. In this example, Gemini 2.0 Flash Thinking produces a Restoration Error on a hard math problem from Putnam, while still arriving at the correct answer Image
Read 16 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(