Post

Shen Zhuoran

@CMS_Flash

Nov 26, 2025 • 1 tweets • 3 min read • Read on X

I want to break down how challenging the setup is and how fundamental the breakthrough will be. It requires abilities to:

- recognize a computer interface from a video stream, w/o APIs
- reason with complexity under tight time limits
- execute actions on a computer w/ no need of APIs
- do all the above in <150ms

The 3 combined will not only be a massive game RL milestone, but also unlock the potential to

- massively automate any work primarily done on a computer
- without needing manual work to write APIs for each legacy software
- execute actions at a human or superhuman speed

That will be a moment that fundamentally extends AI's capabilities and reshape the entire economy.

More details:

# Setup

Previous works like @OpenAI Five and @GoogleDeepMind AlphaStar all used APIs to read game states and execute actions. So they have instant access to the most accurate game state data, sometimes more than humans have access to (e.g. AlphaStar's earlier version has a global vision, but humans only have a local vision). And their execution accuracy will be perfect (unless they introduces some artificial random offsets and random delays as later versions of AlphaStar did).

@grok 5 will read a camera stream, parse out all the information, remember things off screen or happened a few minutes before, and locate the exact pixel to click at a competitive reaction time.

## Reaction speed

Pro players have reaction times down to 150ms, so that's the latency we can tolerate from camera capture to execution output.

The model also has to be able to have a very high throughput of actions. I am not as familiar with League of Legends, but in StarCraft 2, elite professional players can perform >1000 actions per minute during intense battles. That translates to >16Hz of action output.

## Perception

To do this, we need high-speed, from-pixel computer interface understanding. The model must be able to read high-resolution raw pixels of a computer interface and understand it in tens of milliseconds.

## Reasoning

The setup introduces challenging reasoning tasks:

1. The model must reason both under tight time limits to decide the best reaction to instantaneous context. For example, the opponent ambushing the champion from a bush.

2. But simultaneously, it also has to have the ability to maintain coherence and reason through a long-time horizon. for example, in a skirmish, the decision to use certain valuable resources or skills could be determined by, the overall strategy of the team, the composition of the team, where the team wants to take the game, and neutral objective timelines.

3. It also has to be able to reason under high uncertainty because the model might decide clicking at a certain pixel is the optimal action at the moment, but there is no guarantee that the action could be accomplished in time or on the exact pixel. The model's strategy must be robust to these imperfections in execution introduced by the video-in action-out interface.

4. It has to reason with imperfect information. This challenge is not new or unique, but still amplified by the new interface.

## Execution

The model has to be able to fluently navigate the computer interface with raw input primitives, like mouse clicks and keyboard inputs. Instead of saying "I want to buy this item in League of Legends," it has to click into the store navigate interface to find the correct item and complete the purchase all using raw computer control primitives.

# Implications

If the model can successfully accomplish all of the above, it means:
1. It can read and understand any computer interface without needing a specialized API.
2. It can navigate any computer interface without any specialized API.
3. It can reason and produce a robust plan, a complex plan, robust tool. Real-world interferences, imperfections, and randomness.
4. It can do all of the above with humans or superhuman speed.

Such a model will be a game changer for AI capabilities and the global economy. Essentially, anything a human expert can do, primarily on a computer, this model will have a high chance to be able to automate it end-to-end, with higher accuracy than an average human practitioner within the same or less amount of time.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Shen Zhuoran

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!