Latest Twitter Threads by @haoailab on Thread Reader App

Apr 15 • 6 tweets • 4 min read

When Ilya Sutskever once explained why next-word prediction leads to intelligence, he made a metaphor: if you can piece together the clues and deduce the criminal’s name on the last page, you have a real understanding of the story. 🕵️‍♂️

Inspired by that idea, we turned to Ace Attorney to test AI's reasoning. It’s the perfect stage: the AI plays as a detective to collect clues, expose contradictions, and uncover the truth.

We put the latest top AI models—GPT-4.1, Gemini 2.5 Pro, Llama-4 Maverick, and more—to the test in Ace Attorney, to see if they could shout Objection! ⚖️, turn the case around, and uncover the truth behind the lies.

Phoenix Wright Ace Attorney is a popular visual novel known for its complex storytelling and courtroom drama. Like a detective novel, it challenges players to connect clues and evidence to expose contradictions and reveal the true culprit.

In our setup, models are tested on the intense cross-examination stage. It must spot contradictions and present the correct evidence to challenge witness testimony. Each level grants 5 lives, allowing limited tolerance for mistakes.

Apr 8 • 4 tweets • 2 min read

LLaMA-4 Maverick performs well on reasoning benchmarks and ranks 2nd on the Chatbot Arena, yet its true performance remains controversial. What if we put them in a transparent gaming environment? 🎮 Our benchmark tells a different story...🤔

Will true intelligence shine through play? Let’s find out 👇

We directly compared LLaMA-4 Maverick with its reported peers.

Despite of LLaMA-4’s strong performance on static benchmarks, in 2048, DeepSeek V3 still outperforms LLaMA-4, reaching a tile value of 256, whereas LLaMA-4 only reaches 128, the same as a random move algorithm.

In Candy Crush and Sokoban, which demand spatial reasoning capability, LLaMA 4 struggles and fails to crack the first level in both 🧩.

Checkout our leaderboard for more detailed results!

huggingface.co/spaces/lmgame/…

Feb 17 • 8 tweets • 4 min read

Reasoning models often waste tokens self-doubting.

Dynasor saves you up to 81% tokens to arrive at the correct answer! 🧠✂️
- Probe the model halfway to get the certainty
- Use Certainty to stop reasoning
- 100% Training-Free, Plug-and-play

🎮Demo: hao-ai-lab.github.io/demo/dynasor-c…

[2/n] Observation: Reasoning models (🟠) use WAY more tokens than needed vs traditional models (🔵).
Although reasoning models achieve higher acc%, they consume much more tokens than traditional models. User who can accept a lower acc% will waste tons more money💰💰.

Why?

Share this page!

Enter URL or ID to Unroll