Fugu stands shoulder-to-shoulder with leading models like Fable and Mythos across the industry's most rigorous engineering, scientific, and reasoning benchmarks.
Beyond Bigger Models: Why are Orchestration Models the Next Frontier
Progress in AI has been driven largely by giant, monolithic models. But the most powerful systems of the future will be collaborative ecosystems.
Today, this orchestration is no longer just a technical optimization. It has become a geopolitical and operational imperative.
For an organization or a nation, relying on a single company's model for critical infrastructure, finance, or governance is a material vulnerability. This risk is no longer a hypothetical possibility, but a reality.
As we have seen with recent export controls imposed on models like Fable and Mythos, access can disappear overnight.
Collective intelligence is the practical hedge against this concentration of power. Because Fugu orchestrates an underlying pool of swappable agents, it simply routes around vendor restrictions.
By orchestrating the world’s models, we are delivering the resilient blueprint required for true AI sovereignty.
How does it work?
Sakana Fugu is itself an LLM, trained to call various LLMs in an agent pool, including instances of itself recursively. Fugu dynamically orchestrates the world's best models to tackle complex, multi-step tasks.
As shown in this figure, Fugu is a multi-agent system that behaves like a single model. You send a request to one endpoint, and Fugu decides how to handle it internally.
Fugu manages model selection, delegation, verification, and synthesis automatically. It solves tasks directly when that is enough, or coordinates a team of expert models when a problem calls for more. The complexity of a multi-agent system never reaches your code.
At launch, Sakana Fugu comes in two models accessed via a single OpenAI-compatible API:
• Fugu balances strong performance with low latency for everyday work. It fits naturally into tools like Codex for coding, as well as chatbots and interactive services. You can also opt specific agents out of its pool for data compliance.
• Fugu Ultra is our flagship model tuned for maximum answer quality on hard, multi-step problems. It coordinates a deeper pool of expert agents for demanding work like AI research, cybersecurity analysis, and patent investigations.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Our new inference-time scaling algorithm enables collective intelligence for AI by allowing multiple frontier models (like Gemini 2.5 Pro, o4-mini, DeepSeek-R1-0528) to cooperate.
Inspired by the power of human collective intelligence, where the greatest achievements arise from the collaboration of diverse minds, we believe the same principle applies to AI. Individual frontier models like ChatGPT, Gemini, and DeepSeek are remarkably advanced, each possessing unique strengths and biases stemming from their training, which we view as valuable resources for collective problem-solving.
AB-MCTS (Adaptive Branching Monte Carlo Tree Search) harnesses these individualities, allowing multiple models to cooperate and engage in effective trial-and-error, solving challenging problems for any single AI. Our initial results on the ARC-AGI-2 benchmark are promising, with AB-MCTS combining o4-mini + Gemini-2.5-Pro + R1-0528, current frontier AI models, significantly outperforming individual models by a substantial margin.
This research builds on our 2024 work on evolutionary model merging, shifting focus from “mixing to create” to “mixing to use” existing, powerful AIs. At Sakana AI, we remain committed to pioneering novel AI systems by applying nature-inspired principles such as evolution and collective intelligence. We believe this work represents a step toward a future where AI systems collaboratively tackle complex challenges, much like a team of human experts, unlocking new problem-solving capabilities and moving beyond single-model limitations.
The AB-MCTS combination of o4-mini + Gemini-2.5-Pro + R1-0528, current frontier AI models, achieves strong performance on the ARC-AGI-2 benchmark, outperforming individual models by a large margin.
Many ARC-AGI-2 examples that were unsolvable by any single LLM were solved by combining multiple LLMs. In some cases, an initially incorrect attempt by o4-mini is used by R1-0528 and Gemini-2.5-Pro as a hint to get to the correct solution.