Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

elvis

@omarsar0

Jul 10 • 21 tweets • 6 min read • Read on X

Scrolly

BREAKING: xAI announces Grok 4

"It can reason at a superhuman level!"

Here is everything you need to know:

Elon claims that Grok 4 is smarter than almost all grad students in all disciplines simultaneously.

100x more training than Grok 2.

10x more compute on RL than any of the models out there.

Performance on Humanity's Last Exam

Elon: "Grok 4 is post-grad level in everything!"

Scaling HLE - Training

More compute, higher intelligence.

(no tools)

With native tool calling, Grok 4 increases the performance significantly.

Look at those curves!

It's important to give AI the right tools. The scaling is clear. Crazy!

Reliable signals are key to making RL work.

There is still the challenge of data.

Elon: "Ultimate reasoning test is AI operating in reality."

Scaling test-time compute

More than 50% of the text-only subset of the HLE problems are solved!

The curves keep getting more ridiculous.

Grok 4 is the single-agent version.

Grok 4 Heavy is the multi-agent version.

Multi-agent systems are no joke!

Grok 4 is being used to predict the World Series champions for this year.

These are the interesting tasks that reasoning models need to be tested on. On actual real-world events.

A visualization of two black holes colliding.

Grok 4 uses all kinds of references like papers, reads PDFs, reasons about the details of the simulation, and what data to use.

The example shows a summary of the timeline/changes and score announcements in the HLE.

That's pretty cool!

Multi-modal performance

Grok 4 Heavy performance is higher than Grok 4, but needs to be improved further. It's one of the weaknesses, according to the team.

Performance on Reasoning benchmarks.

Perfect score on AIME25!

Leaps are crazy compared to the last best model on these tasks.

Where to test the models.

Available as SuperGrok Heavy tier.

$30/m for Super Grok
$300/m for SuperGrok Heavy.

Voice updates included, too!

Grok feels snappier and is designed to be more natural.

- 2x faster
- 5 voices
- 10x daily user seconds

ARC-AGI

Grok 4 on ARC-AGI v2 (private subset)

It breaks the 10% barrier (15.9%).

2x the second place, which is the Claude Opus 4 model.

Grok 4 on Vending Bench

Grok 4 gets the #1 spot.

Double the net worth of Claude Opus 4.

Grok 4 models are available via the xAI API.

256K context window.

Real-time data search.

Grok 4 for Gaming!

Video understanding is an area the team is improving, so it will get better.

What is next?

Smart and fast will be the focus.

Coding models are also a big focus.

More capable multi-modal agents are coming too.

Video generation models are also on the horizon.

@elonmusk and the @xai team really cooked with Grok 4. All very exciting to see focus on AI for reality, truth-seeking, and unlocking multi-modal agents next.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @omarsar0

elvis

@omarsar0

Jul 8

MemAgent

MemAgent-14B is trained on 32K-length documents with an 8K context window.

Achieves >76% accuracy even at 3.5M tokens!

That consistency is crazy!

Here are my notes:

Overview

Introduces an RL–driven memory agent that enables transformer-based LLMs to handle documents up to 3.5 million tokens with near lossless performance, linear complexity, and no architectural modifications.

RL-shaped fixed-length memory

MemAgent reads documents in segments and maintains a fixed-size memory updated via an overwrite mechanism.

This lets it process arbitrarily long inputs with O(N) inference cost while avoiding context window overflows.

Read 6 tweets

elvis

@omarsar0

Jul 6

Agentic RAG for Personalized Recommendation

This is a really good example of integrating agentic reasoning into RAG.

Leads to better personalization and improved recommendations.

Here are my notes:

Overview

This work introduces a multi-agent framework, ARAG, that enhances traditional RAG systems with reasoning agents tailored to user modeling and contextual ranking.

It reframes recommendation as a structured coordination problem between LLM agents.

Instead of relying on static similarity-based retrieval, ARAG comprises four agents:

- User Understanding Agent synthesizes user preferences from long-term and session behavior.

- NLI Agent evaluates semantic alignment between candidate items and user intent.

- Context Summary Agent condenses relevant item metadata.

- Item Ranker Agent ranks final recommendations using all prior reasoning.

Read 5 tweets

elvis

@omarsar0

Jul 3

AI for Scientific Search

AI for Science is where I spend most of my time exploring with AI agents.

This 120+ pages report does a good job of highlighting why all the big names like OpenAI and Google DeepMind are pursuing AI4Science.

Bookmark it!

My notes below:

There are five key areas:

(1) AI for Scientific Comprehension
(2) AI for Academic Survey
(3) AI for Scientific Discovery
(4) AI for Academic Writing
(5) AI for Academic Peer Review

Just look at the large body of work that's been happening in the space:

Read 11 tweets

elvis

@omarsar0

Jul 1

Small Language Models are the Future of Agentic AI

Lots to gain from building agentic systems with small language models.

Capabilities are increasing rapidly!

AI devs should be exploring SLMs.

Here are my notes:

Overview

This position paper argues that small language models (SLMs), defined pragmatically as those runnable on consumer-grade hardware, are not only sufficient but superior for many agentic AI applications, especially when tasks are narrow, repetitive, or tool-oriented.

The authors propose that shifting from LLM-first to SLM-first architectures will yield major gains in efficiency, modularity, and sustainability.

Read 8 tweets

elvis

@omarsar0

Jun 24

Ultra-Fast LLMs Based on Diffusion

> throughputs of 1109 tokens/sec and 737 tokens/sec
> outperforms speed-optimized frontier models by up to 10× on average

Diffusion LLMs are early, but could be huge.

More in my notes below:

✦ Overview

This paper introduces Mercury, a family of large-scale diffusion-based language models (dLLMs) optimized for ultra-fast inference.

Unlike standard autoregressive LLMs, Mercury models generate multiple tokens in parallel via a coarse-to-fine refinement process.

✦ Achieves higher throughput without sacrificing output quality

The release focuses on code generation, with Mercury Coder Mini and Small models achieving up to 1109 and 737 tokens/sec, respectively, on NVIDIA H100s.

Outperforms speed-optimized frontier models by up to 10×.

Read 7 tweets

elvis

@omarsar0

Jun 23

This paper is impressive!

It introduces a clever way of keeping memory use constant regardless of task length.

Great use of RL for AI agents to efficiently use memory and reasoning.

Here are my full notes:

Overview

The paper presents an RL framework for training language agents that operate efficiently over long-horizon, multi-turn tasks by learning to consolidate memory and reasoning into a compact internal state.

Constant Memory Size

Unlike traditional agents that append all past interactions, leading to ballooning memory usage and degraded performance, MEM1 maintains a constant memory size by discarding obsolete context after each reasoning step.

Read 9 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

elvis

Try unrolling a thread yourself!

More from @omarsar0

elvis

elvis

elvis

elvis

elvis

elvis

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!