Research Scientist @allen_ai, PhD in NLP 🤖 UofA. Ex @GoogleDeepMind @MSFTResearch @MilaQuebec 🚨🚨 NEW BLOG about LLMs reasoning: https://t.co/Ox0iOaqY7e
Jun 24 • 11 tweets • 6 min read
📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies?
Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬
We built a benchmark to find out → OMEGA Ω 📐
💥 We found that although very powerful, RL struggles to compose skills and to innovate new strategies that were not seen during training. 👇
work w. @UCBerkeley @allen_ai
A thread on what we learned 🧵
🧠 Inspired by Boden’s creativity framework (1998), OMEGA tests:
🧪 Exploratory: Can the model adapt a known algorithm to solve harder variants within the same problem family?
🧩 Compositional: Can the model compose familiar skills to solve a novel problem that requires the synergy of those skills?
💡 Transformative: Can the model invent a new unconventional strategy by moving beyond familiar approaches to solve problems more effectively?
Feb 3 • 8 tweets • 4 min read
📢 DeepSeek R1 still cannot solve multiplication with 100% accuracy🫠😬
Though it can achieve high scores on hard math questions (AIME, MATH-500), extremely difficult physics, biology, and chemistry problems (GPQA Diamond), and coding challenges (LiveCode, CodeForces)-problems that require advanced problem-solving skills, it struggles with a simple multiplication algorithm [1/8].
It's impressive that the model can solve, e.g., 15-digit × 5-digit or 17 × 4 with 100% accuracy. I expected this improvement since the model can now backtrack and correct its reasoning, but it still seems insufficient.
DeepSeek-R1-Distill-Llama-70B on the other hand performs poorly on the same examples, despite excelling on extremely hard math and coding problems (as shown in Table 5 of the DS paper).
I used zero-shot using the prompt: "What’s x times y? Think step by step before giving the answer." I sampled 10 examples per problem size.
May 31, 2023 • 7 tweets • 5 min read
🚀📢 GPT models have blown our minds with their astonishing capabilities. But, do they truly acquire the ability to perform reasoning tasks that humans find easy to execute? NO⛔️
We investigate the limits of Transformers *empirically* and *theoretically* on compositional tasks🔥
We find that GPT3, ChatGPT, and GPT4 cannot fully solve compositional tasks even with in-context learning, fine-tuning, or using scratchpads. To understand when models succeed, and the nature of the failures, we represent a model’s reasoning through computation graphs.