Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Dan Schwarz

@dschwarz26

Feb 27 • 11 tweets • 3 min read • Read on X

Ok, this thread is long overdue. Now that everyone sees that GPT-4.5 is a disappointment -

How has OpenAI underperformed so much on the intelligence of their core models?

They've been faring worse and worse since the day GPT-4 came out. Buckle in for the tale:

First, ChatGPT in Nov 2022 and GPT-4 in Mar 2023 were amazing, world-changing innovations. Full credit there.

Shortly after GPT-4, forecasts of GPT-5's release began. The initial view was mid 2024. It has crept up and up and up:

OpenAI did release GPT-4-Turbo in Dec 2023, and GPT-4o in May 2024.

Way faster and cheaper, but hardly better. Fine - for almost a full year, GPT-4 was king. And so OpenAI was still king.

But Anthropic released Claude-3-Opus on March 4, 2024 and took the crown.

Claude-3-Opus was immediately and clearly smarter than GPT-4 class models. I mean real intelligence here, not usability or whatnot.

(We at switched our most important operations over, despite the cost.)

As we know, @AnthropicAI was just getting started:futuresearch.ai

3 months later: Claude-3.5-sonnet, big intelligence jump
4 months later: Claude-3.6-sonnet, big intelligence jump
4 months later: Claude-3.7-sonnet, big intelligence jump

Even Google caught up and surpassed OpenAI in core LLMs during this time. Google!!

But - you say - what about thinking models? Isn't OpenAI still king there?

I don't think so!

In Sept 2024, OpenAI "released" o1-preview. It was expensive, slow, and not widely available.

And it was worse than Claude on average, at least in our evals: futuresearch.ai/llm-agent-eval

But that was a preview. o1 was the king of thinking models, right?

The huge cost and latency increase over Sonnet make it only usable in niche cases. Who actually used it in production? Nobody I know.

Claude-3.7-Sonnet-Thinking might have dethroned it anyway (too soon to say).

And o3? Well o3 hasn't been released!

o3 exists only in OpenAI Deep Research. Which is an impressive report writer, but not very intelligent, at least not on our evals:

Outside Deep Research, o3 is just marketing.futuresearch.ai/oaidr-feb-2025

But, but, but - you say - what about other modalities? What about OpenAI Whisper for audio, or Dalle-3 for images, and SORA for video?

None of them are state of the art!

Take SORA for example, its own story of disappointment and delay:

SORA was teased on in Feb 2024. In March, Mira Murati said release "possibly before summer".

More teasers. In Sept was still in private beta.

Finally released in Dec (almost 10 months after the teaser!), and was by then not a top 3 video model!x.com

As I keep saying: OpenAI lost the mandate of heaven.

Why? A thread for another day. Maybe because most of their top researchers left, first to found Anthropic, then again after the board coup.

Disagree? I'll take bets on whether OpenAI will ever reclaim the top spot.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @dschwarz26

Dan Schwarz

@dschwarz26

Sep 12, 2024

https://x.com/DanHendrycks/status/1833152719756116154

Another paper, the fourth in 2024, claiming near or super-human level forecasting accuracy from AI, seen by hundreds of thousands of people.

I call BS. Here's why.

(1/7)

https://x.com/DanHendrycks/status/1833152719756116154

1st, definitions matter. "Superhuman" forecasting should mean beating top Metaculus forecasters, or the overall crowd, not a random sample of them like this paper.

You wouldn’t say a coding AI was “superhuman” if it was top 50% of a coding competition. (2/7)

2nd, methods matter. “Retroactive forecasting”, e.g. predicting the past, requires a robust training window cutoff and hiding present web information. Neither are easy, and a quick check shows data leakage is happening in these papers, see . (3/7) lesswrong.com/posts/4kuXNhPf…

Read 7 tweets

Dan Schwarz

@dschwarz26

Apr 29, 2023

I bet I'm not the only one that has convos like this:

Me: LLMs are generational tech. I'm excited and terrified.

Them: You're worried about a Terminator / Kurzweil scenario?

Me: A bit. I'm more worried about chaos in the next 2-5 years.

Them: What exactly do you mean?

It's a good question, what am I worried about? Well, let me share a scenario.

It's April 2026. I wake up in the morning and check Hacker News. "Hundreds of Starlink satellites burn up in the atmosphere."

I click the link to Wired. The article is clearly GPT-assisted, I don't trust it. I click back to the HN comment section.

The top comment says Starlink is down, satellites are crashing, but they'll all burn up safely.

The second comment says people on the ground are in danger.

Read 13 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Dan Schwarz

Try unrolling a thread yourself!

More from @dschwarz26

Dan Schwarz

Dan Schwarz

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!