Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Zayne Sprague

@ZayneSprague

Sep 19 • 8 tweets • 4 min read • Read on X

Scrolly

To CoT or not to CoT?🤔

300+ experiments with 14 LLMs & systematic meta-analysis of 100+ recent papers

🤯Direct answering is as good as CoT except for math and symbolic reasoning
🤯You don’t need CoT for 95% of MMLU!

CoT mainly helps LLMs track and execute symbolic computation

CoT’s effectiveness in the literature is often based on datasets like MATH and GSM8k. Does it work more broadly?

We went through *all* the papers using CoT from ICLR ‘24 and NAACL/EACL ‘24 and collected experiments from over 100 of them.

Except for math, logical reasoning, and symbolic/algorithmic reasoning, CoT’s benefits are minor. The outlier tasks usually *do* involve some symbolic reasoning.

But the literature doesn’t include recent models like Llama 3.1. So we tested 14 LLMs 🤖 across 20 datasets in 5 categories: Math, Symbolic, Soft Reasoning, Commonsense or Knowledge.

CoT helps a lot with symbolic and math, but not with the other categories.

Fun result: we found that for some models, as much as 95% of the performance gain when using CoT on MMLU and MMLU Pro comes from math related questions only (sometimes as little as 10% of the data in MMLU!)

You can be selective on when to use CoT and save compute.

How do we explain CoT's behavior?

CoT helps by tracking the steps of solving a problem. We compare variants that create explicit plans and solve them directly, with CoT, or with programmatic execution (Python). CoT improves over direct solving but not over using Python

What does this mean for CoT, reasoning, strawberries…? 🍓

Our results show that prompt-based CoT doesn’t help widely. But we emphasize that this doesn’t rule out fine-tuning for better CoT, search, or multi-agent approaches. More to explore!

Bonus: all these graphs

Check out our paper!

📄

We give all prompts and model outputs on Huggingface!

arxiv.org/abs/2409.12183
huggingface.co/collections/TA…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Zayne Sprague

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!