Nouha Dziri Profile picture
Research Scientist @allen_ai, PhD in NLP 🤖 UofA. Ex @GoogleDeepMind @MSFTResearch @MilaQuebec 🚨🚨 NEW BLOG about o1 models: https://t.co/PPVoY25Ofe
Feb 3 8 tweets 4 min read
📢 DeepSeek R1 still cannot solve multiplication with 100% accuracy🫠😬

Though it can achieve high scores on hard math questions (AIME, MATH-500), extremely difficult physics, biology, and chemistry problems (GPQA Diamond), and coding challenges (LiveCode, CodeForces)-problems that require advanced problem-solving skills, it struggles with a simple multiplication algorithm [1/8].Image It's impressive that the model can solve, e.g., 15-digit × 5-digit or 17 × 4 with 100% accuracy. I expected this improvement since the model can now backtrack and correct its reasoning, but it still seems insufficient.

DeepSeek-R1-Distill-Llama-70B on the other hand performs poorly on the same examples, despite excelling on extremely hard math and coding problems (as shown in Table 5 of the DS paper).

I used zero-shot using the prompt: "What’s x times y? Think step by step before giving the answer." I sampled 10 examples per problem size.
May 31, 2023 7 tweets 5 min read
🚀📢 GPT models have blown our minds with their astonishing capabilities. But, do they truly acquire the ability to perform reasoning tasks that humans find easy to execute? NO⛔️

We investigate the limits of Transformers *empirically* and *theoretically* on compositional tasks🔥 Image We find that GPT3, ChatGPT, and GPT4 cannot fully solve compositional tasks even with in-context learning, fine-tuning, or using scratchpads. To understand when models succeed, and the nature of the failures, we represent a model’s reasoning through computation graphs. Image