Hao AI Lab at UCSD. Our mission is to democratize large machine learning models, algorithms, and their underlying systems.
Apr 15 • 6 tweets • 4 min read
When Ilya Sutskever once explained why next-word prediction leads to intelligence, he made a metaphor: if you can piece together the clues and deduce the criminal’s name on the last page, you have a real understanding of the story. 🕵️♂️
Inspired by that idea, we turned to Ace Attorney to test AI's reasoning. It’s the perfect stage: the AI plays as a detective to collect clues, expose contradictions, and uncover the truth.
We put the latest top AI models—GPT-4.1, Gemini 2.5 Pro, Llama-4 Maverick, and more—to the test in Ace Attorney, to see if they could shout Objection! ⚖️, turn the case around, and uncover the truth behind the lies.
Phoenix Wright Ace Attorney is a popular visual novel known for its complex storytelling and courtroom drama. Like a detective novel, it challenges players to connect clues and evidence to expose contradictions and reveal the true culprit.
In our setup, models are tested on the intense cross-examination stage. It must spot contradictions and present the correct evidence to challenge witness testimony. Each level grants 5 lives, allowing limited tolerance for mistakes.
Apr 8 • 4 tweets • 2 min read
LLaMA-4 Maverick performs well on reasoning benchmarks and ranks 2nd on the Chatbot Arena, yet its true performance remains controversial. What if we put them in a transparent gaming environment? 🎮 Our benchmark tells a different story...🤔
Will true intelligence shine through play? Let’s find out 👇
We directly compared LLaMA-4 Maverick with its reported peers.
Despite of LLaMA-4’s strong performance on static benchmarks, in 2048, DeepSeek V3 still outperforms LLaMA-4, reaching a tile value of 256, whereas LLaMA-4 only reaches 128, the same as a random move algorithm.
In Candy Crush and Sokoban, which demand spatial reasoning capability, LLaMA 4 struggles and fails to crack the first level in both 🧩.
Checkout our leaderboard for more detailed results!
Reasoning models often waste tokens self-doubting.
Dynasor saves you up to 81% tokens to arrive at the correct answer! 🧠✂️
- Probe the model halfway to get the certainty
- Use Certainty to stop reasoning
- 100% Training-Free, Plug-and-play
🎮Demo: hao-ai-lab.github.io/demo/dynasor-c…
[2/n] Observation: Reasoning models (🟠) use WAY more tokens than needed vs traditional models (🔵).
Although reasoning models achieve higher acc%, they consume much more tokens than traditional models. User who can accept a lower acc% will waste tons more money💰💰.