Models, datasets and benchmarks to pay attention to:
▪️ Gemini 2.5 Flash and Pro, plus Gemini 2.5 Flash-Lite
▪️ MiniMax-M1
▪️ Kimi-Dev-72B
▪️ SHADE-Arena benchmark
▪️ ESSENTIAL-WEB V1.0 dataset
🧵
1. @Google introduced Gemini 2.5 Flash and Pro as stable and production-ready, and launched Gemini 2.5 Flash-Lite in preview – the fastest and most cost-efficient.
Flash-Lite outperforms 2.0 Flash-Lite in coding, math, science, reasoning, and multimodal benchmarks. It features lower latency, supports 1 million-token context, multimodal input, and connects to tools like Google Search and code execution
storage.googleapis.com/deepmind-media…
2. MiniMax-M1 by @MiniMax__AI
3. Kimi-Dev-72B by Moonshot AI
It's a 72.7B-parameter open-source coding LLM fine-tuned from Qwen2.5-72B. Sets a new SOTA on SWE-bench Verified with 60.4% accuracy. Optimizes with large-scale RL to fix real GitHub Docker issues, rewarded only when full test suites pass.
Available on Hugging Face and GitHub
4. @AnthropicAI, @scale_AI, and @redwood_ai developed SHADE-Arena, a suite of 17 complex evaluations testing if LLMs can secretly complete sabotage tasks alongside benign ones.
Models needed to complete tasks and avoid AI detection. None had over 30% success; evasion topped at ~60%. Claude Sonnet 3.7 better concealed thoughts. Gemini 2.5 Pro surpassed humans but had many false positives.
anthropic.com/research/shade…
5. ESSENTIAL-WEB V1.0 dataset by @essential_ai
It's a 24-trillion-token Common Crawl corpus annotated with a 12-category taxonomy across 23.6B documents.
Labels made with Qwen2.5-32B-Instruct were distilled into a 0.5B model, making annotation 50x faster with less than 3% quality loss.
Filters helped domain datasets beat or match SOTA: math (-8%), code (+14.3%), STEM (+24.5%), medical (+8.6%). All data and tools are open-source
arxiv.org/abs/2506.14111
Stay ahead with other fascinating AI/ML news in our free weekly digest: turingpost.com/p/fod106
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.