Ruben Hassid Profile picture
Jun 7 14 tweets 3 min read Read on X
BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.

They just memorize patterns really well.

Here's what Apple discovered:

(hint: we're not as close to AGI as the hype suggests) Image
Instead of using the same old math tests that AI companies love to brag about, Apple created fresh puzzle games.

They tested Claude Thinking, DeepSeek-R1, and o3-mini on problems these models had never seen before.

The result ↓
All "reasoning" models hit a complexity wall where they completely collapse to 0% accuracy.

No matter how much computing power you give them, they can't solve harder problems. Image
As problems got harder, these "thinking" models actually started thinking less.

They used fewer tokens and gave up faster, despite having unlimited budget.
Apple researchers even tried giving the models the exact solution algorithm.

Like handing someone step-by-step instructions to bake a cake.

The models still failed at the same complexity points.

They can't even follow directions consistently.
The research revealed three regimes:

• Low complexity: Regular models actually win
• Medium complexity: "Thinking" models show some advantage
• High complexity: Everything breaks down completely

Most problems fall into that third category. Image
Apple discovered that these models are not reasoning at all, but instead doing sophisticated pattern matching that works great until patterns become too complex.

Then they fall apart like a house of cards.
If these models were truly "reasoning," they should get better with more compute and clearer instructions.

Instead, they hit hard walls and start giving up.

Is that intelligence or memorization hitting its limits?
This research suggests we're not as close to AGI as the hype suggests.

Current "reasoning" breakthroughs may be hitting fundamental walls that can't be solved by just adding more data or compute.
Models could handle 100+ moves in Tower of Hanoi puzzles but failed after just 4 moves in River Crossing puzzles.

This suggests they memorized Tower of Hanoi solutions during training but can't actually reason. Image
While AI companies celebrate their models "thinking," Apple basically said "Everyone's celebrating fake reasoning."

The industry is chasing metrics that don't measure actual intelligence.
Apple's researchers used controllable puzzle environments specifically because:

• They avoid data contamination
• They require pure logical reasoning
• They can scale complexity precisely
• They reveal where models actually break

Smart experimental design if you ask me.
What do you think?

Is Apple just "coping" because they've been outpaced in AI developments over the past two years?

Or is Apple correct?

Comment below and I'll respond to all.
If you found this thread valuable:

1. Follow me @RubenHssd for more threads around what's happening around AI and it's implications.

2. RT the first tweet

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ruben Hassid

Ruben Hassid Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @RubenHssd

Jun 3
AI is killing entry-level jobs faster than LinkedIn can post them.

Anthropic's CEO predicts 50% of entry-level white-collar jobs will be gone in 5 years.

If you're under 30, here's what you should do: Image
First, context:

• Big Tech hired 25% fewer new grads in 2024
• Entry-level tech ads down 37% year-over-year
• Legal review roles down 45%

The question is why it's getting so much worse? Let me explain ↓ Image
Junior developers used to debug code for hours → GitHub Copilot does it in seconds

First-year lawyers reviewed contracts for weeks → AI does it overnight

Entry-level analysts built spreadsheets → ChatGPT builds them instantly

But the real damage goes deeper. Image
Read 14 tweets
May 19
Most ChatGPT answers are "yes-man."

After I began using the RPT prompt technique, ChatGPT cut its wrong answers by 40% and gave me much better results.

Here's what the RPT technique is and how to use it:

(my exact prompt + example) Image
To prompt better, first understand why AI is so agreeable.

AI gets trained by human feedback (Reinforcement Learning from Human Feedback)

When we give thumbs up to answers that sound good, the AI learns to say what makes us happy.

It's pleasing us, not checking facts. Image
The fix was discovered in January 2025 by Tsinghua University researchers.

They developed "Reasoning-through-Perspective-Transition" (RPT) prompting and tested it on 12 different types of questions.

It worked much better than regular methods by making the AI argue with itself. Image
Read 13 tweets
May 15
AI is eating up consulting.

I used 3 prompts on ChatGPT to mimic a $600/hour consultant, achieving in 4 hours what takes weeks for teams.

Here's how to make ChatGPT your pocket-sized McKinsey consultant:

(my exact prompt + examples) Image
There are 3 main tasks that a consultant does:

1. Research
2. Making slide decks
3. Reading reports

I'll show you how to automate all 3 in a way that makes one person do the work of an entire team.
Task 1: Research

Go to ChatGPT, use o3 + Deep Research.

Type this prompt:

"As a Big Four consultant, execute a DMAIC-driven Deep Research
workflow in ChatGPT—Define, Measure, Analyze, Improve, and
Control data on [topic]—structuring findings in MECE-compliant
categories into a concise executive summary."
Read 14 tweets
Jan 29
BREAKING: a new Chinese model is out.

And no. This is NOT DeepSeek.

Meet Qwen-2.5, from the giant Alibaba:

1. It can code, write text, search the web.
2. It can generate images, like Dall-E.
3. It can even generate videos.

Here's everything you need to know:
1. Omni-modal mastery:

☑Handles document parsing (tables, charts, handwriting)

☑ Object grounding (JSON coordinates)

☑Long video (hours-long footage with second-level event localization)

☑ Generates structured outputs for invoices, forms & spatial reasoning tasks

2. Scalable architecture

☑ Available in 3B, 7B, and 72B parameter sizes

☑ Proprietary MoE variants (Qwen2.5-Turbo/Plus) for enterprise-scale efficiency. Image
Read 8 tweets
Jan 20
BREAKING: DeepSeek released R1.

And it's already better than ChatGPT o1.

1. It's open-source.
2. API is 96.4% cheaper than chatgpt.
3. I run my tests below. A quick thread:
test #1 → act as an indie hacker to promote my SaaS business.

DeepSeek is so fast. It's insane.

It gave me an entire strategy in 7 actionable steps. So I asked for more:
I asked it to give me new strategies.

It suggested growing profiles with EasyGen and sharing the results.

That's actually what I did with my employees. I'm impressed.

Then, last test:
Read 6 tweets
Dec 30, 2024
DeepSeek is shaking up the AI industry.

1. It's open-source.
2. It's multimodal (images, PDFs...).
3. API is 94.81% cheaper than chatgpt.

I run my own tests below:
#1 → write an SEO-optimized article
#1. Search the web + write an SEO article

> DeepSeek provided 50 sources.
> chatgpt provided roughly 20 sources.

The article is even better than chatgpt.

DeepSeek won. Next test:
#2. DeepThink reasoning

It can reason while reading PDFs. So I uploaded a deck & asked for an article.

chatgpt o1 can only reason while reading images, not PDFs.

DeepSeek results are even faster.

Now, last test:
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(