Shen Zhuoran Profile picture
Nov 26 1 tweets 3 min read Read on X
I want to break down how challenging the setup is and how fundamental the breakthrough will be. It requires abilities to:

- recognize a computer interface from a video stream, w/o APIs
- reason with complexity under tight time limits
- execute actions on a computer w/ no need of APIs
- do all the above in <150ms

The 3 combined will not only be a massive game RL milestone, but also unlock the potential to

- massively automate any work primarily done on a computer
- without needing manual work to write APIs for each legacy software
- execute actions at a human or superhuman speed

That will be a moment that fundamentally extends AI's capabilities and reshape the entire economy.

More details:

# Setup

Previous works like @OpenAI Five and @GoogleDeepMind AlphaStar all used APIs to read game states and execute actions. So they have instant access to the most accurate game state data, sometimes more than humans have access to (e.g. AlphaStar's earlier version has a global vision, but humans only have a local vision). And their execution accuracy will be perfect (unless they introduces some artificial random offsets and random delays as later versions of AlphaStar did).

@grok 5 will read a camera stream, parse out all the information, remember things off screen or happened a few minutes before, and locate the exact pixel to click at a competitive reaction time.

## Reaction speed

Pro players have reaction times down to 150ms, so that's the latency we can tolerate from camera capture to execution output.

The model also has to be able to have a very high throughput of actions. I am not as familiar with League of Legends, but in StarCraft 2, elite professional players can perform >1000 actions per minute during intense battles. That translates to >16Hz of action output.

## Perception

To do this, we need high-speed, from-pixel computer interface understanding. The model must be able to read high-resolution raw pixels of a computer interface and understand it in tens of milliseconds.

## Reasoning

The setup introduces challenging reasoning tasks:

1. The model must reason both under tight time limits to decide the best reaction to instantaneous context. For example, the opponent ambushing the champion from a bush.

2. But simultaneously, it also has to have the ability to maintain coherence and reason through a long-time horizon. for example, in a skirmish, the decision to use certain valuable resources or skills could be determined by, the overall strategy of the team, the composition of the team, where the team wants to take the game, and neutral objective timelines.

3. It also has to be able to reason under high uncertainty because the model might decide clicking at a certain pixel is the optimal action at the moment, but there is no guarantee that the action could be accomplished in time or on the exact pixel. The model's strategy must be robust to these imperfections in execution introduced by the video-in action-out interface.

4. It has to reason with imperfect information. This challenge is not new or unique, but still amplified by the new interface.

## Execution

The model has to be able to fluently navigate the computer interface with raw input primitives, like mouse clicks and keyboard inputs. Instead of saying "I want to buy this item in League of Legends," it has to click into the store navigate interface to find the correct item and complete the purchase all using raw computer control primitives.

# Implications

If the model can successfully accomplish all of the above, it means:
1. It can read and understand any computer interface without needing a specialized API.
2. It can navigate any computer interface without any specialized API.
3. It can reason and produce a robust plan, a complex plan, robust tool. Real-world interferences, imperfections, and randomness.
4. It can do all of the above with humans or superhuman speed.

Such a model will be a game changer for AI capabilities and the global economy. Essentially, anything a human expert can do, primarily on a computer, this model will have a high chance to be able to automate it end-to-end, with higher accuracy than an average human practitioner within the same or less amount of time.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Shen Zhuoran

Shen Zhuoran Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(