Dan Schwarz Profile picture
Feb 27 11 tweets 3 min read Read on X
Ok, this thread is long overdue. Now that everyone sees that GPT-4.5 is a disappointment -

How has OpenAI underperformed so much on the intelligence of their core models?

They've been faring worse and worse since the day GPT-4 came out. Buckle in for the tale:
First, ChatGPT in Nov 2022 and GPT-4 in Mar 2023 were amazing, world-changing innovations. Full credit there.

Shortly after GPT-4, forecasts of GPT-5's release began. The initial view was mid 2024. It has crept up and up and up: Image
OpenAI did release GPT-4-Turbo in Dec 2023, and GPT-4o in May 2024.

Way faster and cheaper, but hardly better. Fine - for almost a full year, GPT-4 was king. And so OpenAI was still king.

But Anthropic released Claude-3-Opus on March 4, 2024 and took the crown.
Claude-3-Opus was immediately and clearly smarter than GPT-4 class models. I mean real intelligence here, not usability or whatnot.

(We at switched our most important operations over, despite the cost.)

As we know, @AnthropicAI was just getting started:futuresearch.ai
3 months later: Claude-3.5-sonnet, big intelligence jump
4 months later: Claude-3.6-sonnet, big intelligence jump
4 months later: Claude-3.7-sonnet, big intelligence jump

Even Google caught up and surpassed OpenAI in core LLMs during this time. Google!!
But - you say - what about thinking models? Isn't OpenAI still king there?

I don't think so!

In Sept 2024, OpenAI "released" o1-preview. It was expensive, slow, and not widely available.

And it was worse than Claude on average, at least in our evals: futuresearch.ai/llm-agent-eval
But that was a preview. o1 was the king of thinking models, right?

The huge cost and latency increase over Sonnet make it only usable in niche cases. Who actually used it in production? Nobody I know.

Claude-3.7-Sonnet-Thinking might have dethroned it anyway (too soon to say).
And o3? Well o3 hasn't been released!

o3 exists only in OpenAI Deep Research. Which is an impressive report writer, but not very intelligent, at least not on our evals:

Outside Deep Research, o3 is just marketing.futuresearch.ai/oaidr-feb-2025
But, but, but - you say - what about other modalities? What about OpenAI Whisper for audio, or Dalle-3 for images, and SORA for video?

None of them are state of the art!

Take SORA for example, its own story of disappointment and delay:
SORA was teased on in Feb 2024. In March, Mira Murati said release "possibly before summer".

More teasers. In Sept was still in private beta.

Finally released in Dec (almost 10 months after the teaser!), and was by then not a top 3 video model!x.com
As I keep saying: OpenAI lost the mandate of heaven.

Why? A thread for another day. Maybe because most of their top researchers left, first to found Anthropic, then again after the board coup.

Disagree? I'll take bets on whether OpenAI will ever reclaim the top spot.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Dan Schwarz

Dan Schwarz Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @dschwarz26

Sep 12, 2024
Another paper, the fourth in 2024, claiming near or super-human level forecasting accuracy from AI, seen by hundreds of thousands of people.

I call BS. Here's why.

(1/7)
1st, definitions matter. "Superhuman" forecasting should mean beating top Metaculus forecasters, or the overall crowd, not a random sample of them like this paper.

You wouldn’t say a coding AI was “superhuman” if it was top 50% of a coding competition. (2/7)
2nd, methods matter. “Retroactive forecasting”, e.g. predicting the past, requires a robust training window cutoff and hiding present web information. Neither are easy, and a quick check shows data leakage is happening in these papers, see . (3/7) lesswrong.com/posts/4kuXNhPf…
Image
Read 7 tweets
Apr 29, 2023
I bet I'm not the only one that has convos like this:

Me: LLMs are generational tech. I'm excited and terrified.

Them: You're worried about a Terminator / Kurzweil scenario?

Me: A bit. I'm more worried about chaos in the next 2-5 years.

Them: What exactly do you mean?
It's a good question, what am I worried about? Well, let me share a scenario.

It's April 2026. I wake up in the morning and check Hacker News. "Hundreds of Starlink satellites burn up in the atmosphere."
I click the link to Wired. The article is clearly GPT-assisted, I don't trust it. I click back to the HN comment section.

The top comment says Starlink is down, satellites are crashing, but they'll all burn up safely.

The second comment says people on the ground are in danger.
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(