Elon claims that Grok 4 is smarter than almost all grad students in all disciplines simultaneously.
100x more training than Grok 2.
10x more compute on RL than any of the models out there.
Performance on Humanity's Last Exam
Elon: "Grok 4 is post-grad level in everything!"
Scaling HLE - Training
More compute, higher intelligence.
(no tools)
With native tool calling, Grok 4 increases the performance significantly.
Look at those curves!
It's important to give AI the right tools. The scaling is clear. Crazy!
Reliable signals are key to making RL work.
There is still the challenge of data.
Elon: "Ultimate reasoning test is AI operating in reality."
Scaling test-time compute
More than 50% of the text-only subset of the HLE problems are solved!
The curves keep getting more ridiculous.
Grok 4 is the single-agent version.
Grok 4 Heavy is the multi-agent version.
Multi-agent systems are no joke!
Grok 4 is being used to predict the World Series champions for this year.
These are the interesting tasks that reasoning models need to be tested on. On actual real-world events.
A visualization of two black holes colliding.
Grok 4 uses all kinds of references like papers, reads PDFs, reasons about the details of the simulation, and what data to use.
The example shows a summary of the timeline/changes and score announcements in the HLE.
That's pretty cool!
Multi-modal performance
Grok 4 Heavy performance is higher than Grok 4, but needs to be improved further. It's one of the weaknesses, according to the team.
Performance on Reasoning benchmarks.
Perfect score on AIME25!
Leaps are crazy compared to the last best model on these tasks.
Where to test the models.
Available as SuperGrok Heavy tier.
$30/m for Super Grok
$300/m for SuperGrok Heavy.
Voice updates included, too!
Grok feels snappier and is designed to be more natural.
- 2x faster
- 5 voices
- 10x daily user seconds
ARC-AGI
Grok 4 on ARC-AGI v2 (private subset)
It breaks the 10% barrier (15.9%).
2x the second place, which is the Claude Opus 4 model.
Grok 4 on Vending Bench
Grok 4 gets the #1 spot.
Double the net worth of Claude Opus 4.
Grok 4 models are available via the xAI API.
256K context window.
Real-time data search.
Grok 4 for Gaming!
Video understanding is an area the team is improving, so it will get better.
What is next?
Smart and fast will be the focus.
Coding models are also a big focus.
More capable multi-modal agents are coming too.
Video generation models are also on the horizon.
@elonmusk and the @xai team really cooked with Grok 4. All very exciting to see focus on AI for reality, truth-seeking, and unlocking multi-modal agents next.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
The spec-init slash command prompt, if you want to try it:
"Your task is to first help me build a spec for my new project in ARGUMENT.
Use the AskUserQuestion Tool to help build the spec in ARGUMENT by interviewing me and gathering requirements and details about the project implementation, UI & UX, tech stack, concerns, tradeoffs, etc.
Make sure questions are not obvious and probe deeper into the underlying needs and constraints.
Interview me continually and systematically until the spec is complete. Document all responses and insights to create a comprehensive and well-structured specification that serves as the foundation for the project."
Just built a new skill in Claude Code using Opus 4.5.
The skill uses Gemini 3 Pro (via API) for designing web pages.
Look at what it generated from one simple prompt.
If you have been designing websites with Claude Code, you already know how generic they turn out.
So I built a skill that uses Gemini 3 Pro to lead creative direction and generate designs. It is extremely good at this.
Opus 4.5 then integrates all that into our app.
The prompt I used: "I want to design the landing page for a new AI game. We want it to be futuristic and all that, and use animations as much as possible."
I will test with some other prompts and see how far I can push this. But the results are very exciting already.
This is one of the most insane things Nano Banana Pro 🍌 can do.
It can reproduce figures with mind-blowing precision.
No competition in this regard!
Prompt: "Please reproduce this chart in high quality and fidelity and offer annotated labels to better understand it."
When I tried this for the first time, I didn't expect that this was possible.
The level of understanding this requires is what's remarkable about it all.
The levels of personalization this unlocks are also impressive.
"Can you convert it into a cartoonish version?"
Just look at this 🤯
"Can you create a delightful cartoonish version of this table. And please put cute colors and icons along with interesting annotations to make it more readable."