Akshara Prabhakar Profile picture
Oct 7 8 tweets 3 min read Read on X
🤖 NEW PAPER 🤖

Chain-of-thought reasoning (CoT) can dramatically improve LLM performance

Q: But what *type* of reasoning do LLMs use when performing CoT? Is it genuine reasoning, or is it driven by shallow heuristics like memorization?

A: Both!

🔗
1/n arxiv.org/abs/2407.01687Plotting LLM accuracy on shift ciphers shows trends of genuine reasoning and shallow memorization.
@RTomMcCoy @cocosci_lab We test LLMs on decoding shift ciphers, simple ciphers in which each letter is shifted forward a certain distance in the alphabet. Eg, DOG shifted 1 is EPH

Why shift ciphers? They let us disentangle reasoning from heuristics! (see quoted thread)



2/n
@RTomMcCoy @cocosci_lab We identify 3 key factors affecting CoT performance
1. Probability of task outputs
2. Frequency of task during pre-training (memorization)
3. Number of intermediate reasoning steps (noise)

Trends hold across several models: GPT-4, Claude 3, Llama 3.1
⬇️ for insights with o1

3/n
@RTomMcCoy @cocosci_lab The first two factors are indicative of statistical heuristics, while the third one is indicative of a noisy version of genuine reasoning!

More details about these three effects:

4/n
@RTomMcCoy @cocosci_lab 1. Probabilistic effects: We often see unfaithfulness between the reasoning chain and the final output, showing a bias against low-probability outputs.
2. Task frequency effects: 13 is the most common shift level in Internet corpora, and LLMs achieve the best accuracy at 13!

5/n
@RTomMcCoy @cocosci_lab 3. Noisy reasoning: LLMs exhibit bidirectional reasoning: shift backward N steps for small shift levels / forward (26-N) steps for large shift levels to minimize the number of implicit operations
- But sometimes mix them up by shifting forward N steps / backward (26-N) steps

6/n Plot showing the normalized frequency distribution vs. LLM predicted shift level of step answers on shift cipher task.
@RTomMcCoy @cocosci_lab We analyzed o1 across 3 different shifts and 2 probability bins without using CoT prompts
1. Same trend still holds: Higher accuracy for rot-13 (memorization) and high-prob bin 1
2. Low prob bin 5 needs much more reasoning tokens for the same shift level!

7/n Plot showing o1's performance on shift ciphers.
@RTomMcCoy @cocosci_lab To summarize, CoT reasoning can be characterized as probabilistic, memorization-influenced noisy reasoning, i.e., LLMs display traits of both memorization and generalization

Work with the amazing @RTomMcCoy, Tom Griffiths @cocosci_lab

To appear in Findings of EMNLP!

8/8

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Akshara Prabhakar

Akshara Prabhakar Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(