1/14

🚨 New preprint: Using Large Language Models to Estimate Belief Strength in Reasoning 🚨

A 🧵👇 Abstract  Accurately quantifying belief strength in heuristics-and-biases tasks is crucial yet methodologically challenging. In this paper, we introduce an automated method leveraging large language models (LLMs) to systematically measure and manipulate belief strength. We specifically tested this method in the widely used “lawyer-engineer” base-rate neglect task, in which stereotypical descriptions (e.g., someone enjoying mathematical puzzles) conflict with normative base-rate information (e.g., engineers represent a very small percentage of the sample). Using this approach, we created an ...
2/14

When asked: "There are 995 politicians and 5 nurses. Person 'L' is kind. Is Person 'L' more likely to be a politician or a nurse?", most people will answer "nurse", neglecting the base-rate info.
3/14

Cognitive biases often involve a mental conflict between intuitive beliefs (“nurses are kind”) and logical or probabilistic information (995 vs 5). 🤯

But how strong is the pull of that belief?
4/14

We argue that measuring “belief strength” is a major bottleneck in reasoning research, which mostly relies on conflict vs. no-conflict items.
5/14

It requires costly human ratings and is rarely done parametrically, limiting the development of theoretical and computational models of biased reasoning. Illustration of different hypothetical response functions (linear, sigmoid, and step-like) linking belief strength to choice probability when only two belief strength levels (low and high, indicated by black diamonds) are used. This limited binary approach restricts researchers’ ability to precisely characterize participants’ underlying cognitive processes or strategies and differentiate among competing theoretical models.
6/14

Could LLMs help? 🤖

For once, having human-like biases is desirable! Because LLMs are trained on vast amounts of human text, they implicitly encode typical associations, and may be great at measuring belief strength!
7/14

We tested this idea on the classic lawyer–engineer base-rate neglect task, asking GPT-4 and LLaMA 3.3 to rate how strongly traits (like “kind”) are associated with groups (like “nurse”) using typicality ratings, a proxy for p(trait|group).
8/14

And it works really well! LLM-generated ratings showed a very strong correlation with human judgments.

More importantly, our belief-strength measure robustly predicted participants' actual choices in a separate base-rate neglect experiment! Figure showing a positive correlation between human and llm typicality rating on left panel and how our belier strenght measure predicts human choices on the right.
9/14

This method allows us to create a massive database of over 100,000 base-rate items, each with an associated belief strength value. The figure is a schematic representation of how we built the database from group and adjectives to items.
10/14

For instance, here are all the created items for one single adjective out of 66 ("Arrogant")! Best to be a kindergarten teacher than a politician in this case. 🤭 Matrix showing all the possible items created for the adjective "arrogant" for all the possible groups in our study. Upper part shows stereotype strength, and lower part shows the predicted choice probability of one group based on our fitted model.
11/14

We also re-analyzed existing base-rate stimuli from past research using our method. The results revealed a large, previously unnoticed variability in belief strength, which could be problematic in some cases. Histogram showing the distribution of stereotype strength in existing items, spanning a wide range of stereotype strength values from a log ratio of around 0 to a log ration > 2.
12/14

To make this more practical, we release the 'baserater' R package. It allows you to access the database easily and to generate new items automatically using the LLM and prompt of your choice.

GitHub: (soon on CRAN!)jeremie-beucler.github.io/baserater
13/14

Huge thanks to my great co-authors Zoe Purcell, @LucieCharlesCog and @wimdeneys, and to my lab @lapsyde

Stay tuned for the computational modeling part! 🤓
14/14

You can access the preprint here:

#PsychScience #CognitiveBias #ReasoningResearch #LargeLanguageModelsosf.io/preprints/psya…
@threadreaderapp unroll please
@threadreaderapp unroll please

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jérémie Beucler -@jeremiebeucler.bsky.social

Jérémie Beucler -@jeremiebeucler.bsky.social Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jeremie_beucler

Aug 6
1/8

New (and first) paper accepted at JEP:LMC 🎉 Ever fallen for this type of questions: "How many animals of each kind did Moses take on the Ark?" Most say "Two," forgetting it was Noah, and not Moses, who took the animals on the Ark. But what’s really going on here?🧵 Abstract of the paper summarizing the main findings.
2/8

These semantic illusions are often used to test for deliberate "System 2" thinking (e.g., in the verbal CRT). The classic theory? We intuitively fall for the illusion & need slow, effortful deliberation to correct the mistake. But is it really that simple? title of the verbal CRT paper
3/8

To test this, we ran 4 experiments with over 500 participants! We used a two-response paradigm: first, a quick intuitive answer under time pressure & cognitive load. Then, a final, deliberated response with no constraints. Here are the main results: figure showing the experimental paradigm  Figure 1. Experiment 1 trial sequence and examples of load patterns in Experiment 1-3. a) Example of one trial in Experiment 1. Participants had to respond to a trivia question twice, once with a deadline and a concurrent load and a second time without any constraint. b) Example of the to-be-memorized load patterns in Experiment 1-3 (upper panel) and Experiment 2 (lower panel).
Read 8 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(