Research Scientist at @Apple, prev @DeepMind, prev @GeorgiaTech
3 subscribers
Jun 5 • 8 tweets • 4 min read
🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching?
The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks, but we found their fundamental limitations are more severe than expected.
In our latest work, we compared each “thinking” LRM with its “non-thinking” LLM twin. Unlike most prior works that only measure the final performance, we analyzed their actual reasoning traces—looking inside their long "thoughts". Our analysis reveals several interesting results ⬇️
📄 machinelearning.apple.com/research/illus…
Work led by @ParshinShojaee and @i_mirzadeh, and with @KeivanAlizadeh2, @mchorton1991, Samy Bengio.
🧵 2/8 The evaluation problem for reasoning 📊
Most studies judge reasoning by final answer accuracy on popular math & coding benchmarks. Those benchmarks are:
- Sometimes leaked into training data
- Fixed in difficulty and hard to control
- Blind to what happens inside the chain-of-thought.
So, we can NOT easily tell if improvements come from better reasoning or just more compute + memorization, or a synthesized self-reflection. We needed a cleaner experimental and evaluation setup.
Oct 10, 2024 • 13 tweets • 6 min read
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the recent OpenAI GPT-4o and o1-series. arxiv.org/pdf/2410.05229
Work done with @i_mirzadeh, @KeivanAlizadeh2, Hooman Shahrokhi, Samy Bengio, @OncelTuzel.
#LLM #Reasoning #Mathematics #AGI #Research #Apple2/ When OpenAI released GSM8K ~3 years ago, GPT-3 (175B) scored 35% on the GSM8K test. Today, models with ~3B parameters are surpassing 85%, and larger ones are hitting >95%. But has model 'reasoning' really improved? How much of this is genuine #logical/#symbolic reasoning? vs. #pattern_recognition, inadvertent data #contamination, or #overfitting?