Post

Akshara Prabhakar

@aksh_555

Oct 7 • 8 tweets • 3 min read • Read on X

🤖 NEW PAPER 🤖

Chain-of-thought reasoning (CoT) can dramatically improve LLM performance

Q: But what *type* of reasoning do LLMs use when performing CoT? Is it genuine reasoning, or is it driven by shallow heuristics like memorization?

A: Both!

🔗
1/n arxiv.org/abs/2407.01687

https://x.com/RTomMcCoy/status/1843325666231755174

@RTomMcCoy @cocosci_lab We test LLMs on decoding shift ciphers, simple ciphers in which each letter is shifted forward a certain distance in the alphabet. Eg, DOG shifted 1 is EPH

Why shift ciphers? They let us disentangle reasoning from heuristics! (see quoted thread)

2/n

https://x.com/RTomMcCoy/status/1843325666231755174

@RTomMcCoy @cocosci_lab We identify 3 key factors affecting CoT performance
1. Probability of task outputs
2. Frequency of task during pre-training (memorization)
3. Number of intermediate reasoning steps (noise)

Trends hold across several models: GPT-4, Claude 3, Llama 3.1
⬇️ for insights with o1

3/n

@RTomMcCoy @cocosci_lab The first two factors are indicative of statistical heuristics, while the third one is indicative of a noisy version of genuine reasoning!

More details about these three effects:

4/n

@RTomMcCoy @cocosci_lab 1. Probabilistic effects: We often see unfaithfulness between the reasoning chain and the final output, showing a bias against low-probability outputs.
2. Task frequency effects: 13 is the most common shift level in Internet corpora, and LLMs achieve the best accuracy at 13!

5/n

@RTomMcCoy @cocosci_lab 3. Noisy reasoning: LLMs exhibit bidirectional reasoning: shift backward N steps for small shift levels / forward (26-N) steps for large shift levels to minimize the number of implicit operations
- But sometimes mix them up by shifting forward N steps / backward (26-N) steps

6/n

@RTomMcCoy @cocosci_lab We analyzed o1 across 3 different shifts and 2 probability bins without using CoT prompts
1. Same trend still holds: Higher accuracy for rot-13 (memorization) and high-prob bin 1
2. Low prob bin 5 needs much more reasoning tokens for the same shift level!

7/n

@RTomMcCoy @cocosci_lab To summarize, CoT reasoning can be characterized as probabilistic, memorization-influenced noisy reasoning, i.e., LLMs display traits of both memorization and generalization

Work with the amazing @RTomMcCoy, Tom Griffiths @cocosci_lab

To appear in Findings of EMNLP!

8/8

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Akshara Prabhakar

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!