Zayne Sprague Profile picture
Sep 19 8 tweets 4 min read Read on X
To CoT or not to CoT?🤔

300+ experiments with 14 LLMs & systematic meta-analysis of 100+ recent papers

🤯Direct answering is as good as CoT except for math and symbolic reasoning
🤯You don’t need CoT for 95% of MMLU!

CoT mainly helps LLMs track and execute symbolic computation

Image
Image
Image
CoT’s effectiveness in the literature is often based on datasets like MATH and GSM8k. Does it work more broadly?

We went through *all* the papers using CoT from ICLR ‘24 and NAACL/EACL ‘24 and collected experiments from over 100 of them.
Except for math, logical reasoning, and symbolic/algorithmic reasoning, CoT’s benefits are minor. The outlier tasks usually *do* involve some symbolic reasoning. Image
But the literature doesn’t include recent models like Llama 3.1. So we tested 14 LLMs 🤖 across 20 datasets in 5 categories: Math, Symbolic, Soft Reasoning, Commonsense or Knowledge.

CoT helps a lot with symbolic and math, but not with the other categories. Image
Fun result: we found that for some models, as much as 95% of the performance gain when using CoT on MMLU and MMLU Pro comes from math related questions only (sometimes as little as 10% of the data in MMLU!)

You can be selective on when to use CoT and save compute. Image
How do we explain CoT's behavior?

CoT helps by tracking the steps of solving a problem. We compare variants that create explicit plans and solve them directly, with CoT, or with programmatic execution (Python). CoT improves over direct solving but not over using Python Image
What does this mean for CoT, reasoning, strawberries…? 🍓

Our results show that prompt-based CoT doesn’t help widely. But we emphasize that this doesn’t rule out fine-tuning for better CoT, search, or multi-agent approaches. More to explore!

Bonus: all these graphs Image
Check out our paper!

📄

We give all prompts and model outputs on Huggingface!

arxiv.org/abs/2409.12183
huggingface.co/collections/TA…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Zayne Sprague

Zayne Sprague Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(