Latest Twitter Threads by @nouhadziri on Thread Reader App

Jun 24 • 11 tweets • 6 min read

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies?

Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬

We built a benchmark to find out → OMEGA Ω 📐

💥 We found that although very powerful, RL struggles to compose skills and to innovate new strategies that were not seen during training. 👇

work w. @UCBerkeley @allen_ai

A thread on what we learned 🧵

🧠 Inspired by Boden’s creativity framework (1998), OMEGA tests:

🧪 Exploratory: Can the model adapt a known algorithm to solve harder variants within the same problem family?
🧩 Compositional: Can the model compose familiar skills to solve a novel problem that requires the synergy of those skills?
💡 Transformative: Can the model invent a new unconventional strategy by moving beyond familiar approaches to solve problems more effectively?

Feb 3 • 8 tweets • 4 min read

📢 DeepSeek R1 still cannot solve multiplication with 100% accuracy🫠😬

Though it can achieve high scores on hard math questions (AIME, MATH-500), extremely difficult physics, biology, and chemistry problems (GPQA Diamond), and coding challenges (LiveCode, CodeForces)-problems that require advanced problem-solving skills, it struggles with a simple multiplication algorithm [1/8].

It's impressive that the model can solve, e.g., 15-digit × 5-digit or 17 × 4 with 100% accuracy. I expected this improvement since the model can now backtrack and correct its reasoning, but it still seems insufficient.

DeepSeek-R1-Distill-Llama-70B on the other hand performs poorly on the same examples, despite excelling on extremely hard math and coding problems (as shown in Table 5 of the DS paper).

I used zero-shot using the prompt: "What’s x times y? Think step by step before giving the answer." I sampled 10 examples per problem size.

May 31, 2023 • 7 tweets • 5 min read

🚀📢 GPT models have blown our minds with their astonishing capabilities. But, do they truly acquire the ability to perform reasoning tasks that humans find easy to execute? NO⛔️

We investigate the limits of Transformers *empirically* and *theoretically* on compositional tasks🔥

We find that GPT3, ChatGPT, and GPT4 cannot fully solve compositional tasks even with in-context learning, fine-tuning, or using scratchpads. To understand when models succeed, and the nature of the failures, we represent a model’s reasoning through computation graphs.

Share this page!

Enter URL or ID to Unroll