📣 Introducing the Qwen-Robot Suite — Qwen-RobotNav, Qwen-RobotManip, Qwen-RobotWorld, three foundation models, a full stack for embodied intelligence.
🧭 Qwen-RobotNav — the gateway to mobility.
• Unifies 5 navigation tasks in one model: instruction following, point-goal, object-goal, target tracking, autonomous driving
• Controllable observation protocol
• Tool interface for agentic systems
🤖 Qwen-RobotManip — the foundation of interaction.
• Unified state-action space across heterogeneous robots
• Camera-frame delta poses for coherent cross-embodiment training
• Pretrained on a 38,100+ hour open-source corpus
🌍 Qwen-RobotWorld — infinite worlds for physical agents.
• Single world model, 20+ embodiments
• Natural-language action interface
• Predicts physically grounded futures across manipulation, driving, and navigation
Each model is independently useful, and could be composed as physical-world tools.Together, they form the low-level toolkit for general-purpose agentic systems that don't just see the world, but act in it.
Qwen-RobotNav:a scalable navigation model built on Qwen3-VL that addresses this through a parameterised interface with two complementary dimensions: task modes that select the navigation behaviour, and controllable observation parameters (token budget, temporal decay, per-camera weights) that govern how visual history is encoded.
Trained on 15.6 million samples with training-time randomization over all parameters, Qwen-RobotNav generalizes to any inference-time configuration without architectural modification, unifying five task families under a single set of weights and serving as a natural building block for agentic systems.
Here's the blog link to know more about Qwen-RobotNav: qwen.ai/blog?id=qwen-r…
Qwen-RobotManip is a generalizable Vision-Language-Action (VLA) foundation model built upon Qwen-VL. It introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting.
Using only open-source robotic manipulation datasets and human demonstration videos without any proprietary data collection, Qwen-RobotManip constructs a ~38,100 hours pretraining corpus and already exhibits emergent generalization capabilities.
🚀 Qwen3.5-397B-A17B is here: The first open-weight model in the Qwen3.5 series.
🖼️Native multimodal. Trained for real-world agents.
✨Powered by hybrid linear attention + sparse MoE and large-scale RL environment scaling.
⚡8.6x–19.0x decoding throughput vs Qwen3-Max
🌍201 languages & dialects
📜Apache2.0 licensed
🎁 A New Year gift from Qwen — Qwen-Image-2512 is here.
🚀 Our December upgrade to Qwen-Image, just in time for the New Year.
✨ What’s new:
• More realistic humans — dramatically reduced “AI look,” richer facial details
• Finer natural textures — sharper landscapes, water, fur, and materials
• Stronger text rendering — better layout, higher accuracy in text–image composition
🏆 Tested in 10,000+ blind rounds on AI Arena, Qwen-Image-2512 ranks as the strongest open-source image model, while staying competitive with closed-source systems.
🚀 Introducing Qwen3-Omni — the first natively end-to-end omni-modal AI unifying text, image, audio & video in one model — no modality trade-offs!
🏆 SOTA on 22/36 audio & AV benchmarks
🌍 119L text / 19L speech in / 10L speech out
⚡ 211ms latency | 🎧 30-min audio understanding
🎨 Fully customizable via system prompts
🔗 Built-in tool calling
🎤 Open-source Captioner model (low-hallucination!)
🌟 What’s Open-Sourced?
We’ve open-sourced Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner, to empower developers to explore a variety of applications from instruction-following to creative tasks.