TuringPost Profile picture
Jun 26 7 tweets 4 min read Read on X
Models, datasets and benchmarks to pay attention to:

▪️ Gemini 2.5 Flash and Pro, plus Gemini 2.5 Flash-Lite
▪️ MiniMax-M1
▪️ Kimi-Dev-72B

▪️ SHADE-Arena benchmark
▪️ ESSENTIAL-WEB V1.0 dataset

🧵 Image
1. @Google introduced Gemini 2.5 Flash and Pro as stable and production-ready, and launched Gemini 2.5 Flash-Lite in preview – the fastest and most cost-efficient.

Flash-Lite outperforms 2.0 Flash-Lite in coding, math, science, reasoning, and multimodal benchmarks. It features lower latency, supports 1 million-token context, multimodal input, and connects to tools like Google Search and code execution

storage.googleapis.com/deepmind-media…Image
3. Kimi-Dev-72B by Moonshot AI

It's a 72.7B-parameter open-source coding LLM fine-tuned from Qwen2.5-72B. Sets a new SOTA on SWE-bench Verified with 60.4% accuracy. Optimizes with large-scale RL to fix real GitHub Docker issues, rewarded only when full test suites pass.

Available on Hugging Face and GitHubImage
4. @AnthropicAI, @scale_AI, and @redwood_ai developed SHADE-Arena, a suite of 17 complex evaluations testing if LLMs can secretly complete sabotage tasks alongside benign ones.

Models needed to complete tasks and avoid AI detection. None had over 30% success; evasion topped at ~60%. Claude Sonnet 3.7 better concealed thoughts. Gemini 2.5 Pro surpassed humans but had many false positives.

anthropic.com/research/shade…Image
5. ESSENTIAL-WEB V1.0 dataset by @essential_ai

It's a 24-trillion-token Common Crawl corpus annotated with a 12-category taxonomy across 23.6B documents.

Labels made with Qwen2.5-32B-Instruct were distilled into a 0.5B model, making annotation 50x faster with less than 3% quality loss.

Filters helped domain datasets beat or match SOTA: math (-8%), code (+14.3%), STEM (+24.5%), medical (+8.6%). All data and tools are open-source

arxiv.org/abs/2506.14111Image
Stay ahead with other fascinating AI/ML news in our free weekly digest: turingpost.com/p/fod106

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with TuringPost

TuringPost Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TheTuringPost

Jun 19
Models and datasets to pay attention to:

▪️ Institutional Books 1.0 - a 242B token dataset
▪️ o3-pro from @OpenAI
▪️ FGN from @GoogleDeepMind
▪️ Magistral by @MistralAI
▪️ Resa: Transparent Reasoning Models via SAEs
▪️ Multiverse (Carnegie+NVIDIA)
▪️ Ming-Omni
▪️ Seedance 1.0 by ByteDance
▪️ Sentinel

🧵Image
Image
Image
1. Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

Sourced from 1,075,899 scanned books across 250+ languages via the Google Books project, the dataset includes both raw and post-processed text and detailed metadata.

arxiv.org/abs/2506.08300Image
2. o3-pro from @OpenAI

A high-reliability LLM for math, science, and coding. It beats o1-pro and o3 in expert tests for clarity, instruction-following, and accuracy. Includes includes tool access (web search, code execution, vision) but responds slower.

Replaces o1-pro for Pro/Team users (they also drop the price of o3 by 80%).

help.openai.com/en/articles/96…Image
Read 12 tweets
Jun 18
The latest AI/ML news if the week:

▪️ @HuggingFace helps to find the best model based on size
▪️ NVIDIA’s Jensen Huang and @ylecun disagree with Anthropic’s Dario Amodei predictions
▪️ @AIatMeta’s Superintelligence Gambit
▪️ @Google adds a voice to Search
▪️ Mattel and @OpenAI: brains to Barbie
▪️ Projects in ChatGPT

Details 🧵Image
Image
Image
1. Hugging Face insists, “Bigger isn’t better”
2. @Nvidia’s Jensen Huang: “I disagree with almost everything he says”
At VivaTech in Paris, he took aim at Anthropic’s Dario Amodei, scoffing at his dire predictions about AI replacing half of entry-level jobs.

Huang argues for open, responsible development – not “dark room” AI monopolies. @ylecun agrees 👇Image
Read 8 tweets
Jun 10
The freshest research papers:

▪️ Self-Challenging Language Model Agents
▪️ Reflect, Retry, Reward
▪️ ProRL
▪️ Beyond the 80/20 Rule
▪️ REASONING GYM
▪️ AlphaOne
▪️ Unleashing the Reasoning Potential...Critique Fine-Tuning
▪️ ARIA
▪️ Incentivizing Reasoning...Instruction Following
▪️ OThink-R1

▪️ Reasoning Like an Economist
▪️ A Controllable Examination for Long-Context LLMs
▪️ SuperWriter

▪️ Protocol Models
▪️ AReaL
▪️ StreamBP
▪️ Taming LLMs by Scaling Learning Rates

▪️ Diagonal Batching
▪️ Inference-Time Hyper-Scaling with KV Cache Compression
▪️ Unified Scaling Laws for Compressed Representations

▪️ GUI-Actor
▪️ Surfer-H Meets Holo1

▪️ Qwen3 Embedding
▪️ Aligning Latent Spaces with Flow Priors
▪️ Large Language Models are Locally Linear Mappings

▪️ Establishing Trustworthy LLM Evaluation
▪️ Evaluation is All You Need
▪️ Datasheets Aren't Enough

🧵Image
Image
Image
1. Self-Challenging Language Model Agents by @AIatMeta, @UCBerkeley

Trains agents to create and solve their own tool-use tasks using code-based problem generation and RL

arxiv.org/abs/2506.01716Image
2. Reflect, Retry, Reward by

Enhances model performance by rewarding useful self-reflection after incorrect answers, using only binary feedback

arxiv.org/abs/2505.24726Image
Read 19 tweets
Jun 7
Log-linear attention — a new type of attention proposed by @MIT which is:

- fast and efficient as linear attention
- expressive as softmax

It uses a small but growing number of memory slots that increases logarithmically with the sequence length.

Here's how it works: Image
1. Input:

At each time step t, you have:

- Query vector (Q): what the model is asking
- Key vector (K): what the model remembers
- Value vector (V): what the model retrieves

They are computed from the input using learned linear projections.
2. Partition past tokens into buckets:

Using Fenwick tree-style hierarchical memory partitioning, the system divides the past tokens into logarithmically many disjointed buckets:

• Each bucket size is a power of two.
• The most recent token forms its own smaller bucket
• Older tokens are grouped into larger buckets

And here's why 👇Image
Read 10 tweets
Jun 6
.@JeffDean interview at @Sequoia’s AI Ascent is a must-watch. He provides a real look at where AI is headed, what’s actually happening in the field, sharing insights on:

• Specialized hardware
• Evolution of models
• Future of computing infrastructure
• AI's role in science and more

Here are the key takeaways:Image
1. Where is AI going these days?

Models are improving fast and solving more problems each year. Hardware, training algorithms, and RL techniques have brought us here — and multimodal is a big focus for what’s next.
2. What about agents?

Jeff Dean sees huge potential in both virtual and robotic agents. With more training and experience, we’ll soon see them doing ~20 useful real-world tasks — unlocking a cycle of usefulness, cost reduction, and further improvements
Read 16 tweets
May 29
Latent reasoning lets the model do more of its "thinking" internally.

This internal info has continuous format compared to the discrete output text.

To efficiently mix this info, researchers from @UofIllinois proposed HRPO (Hybrid Reasoning Policy Optimization) – an RL-based hybrid latent reasoning framework.

Here's how it works:Image
1. HRPO uses reinforcement learning (RL) to train LLMs to reason internally without needing CoT training data.

It integrates hidden states into token sampling using a learnable gating mechanism.
2. A gating mechanism "decides" how much to use internal hidden states vs. regular token info.

At first, the model sticks mostly to word-level input. Over time, it learns to include more of the hidden state features.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(