TuringPost Profile picture
Newsletter exploring AI & ML - AI 101 - ML techniques - AI Business insights - Global dynamics - ML History Led by @kseniase_ Save hours of research 👇🏼
2 subscribers
Mar 25 20 tweets 12 min read
The freshest AI/ML research of the week:

Our top 2
▪️ Xattention
▪️ Inside-Out: Hidden Factual Knowledge in LLMs

▪️ Rwkv-7 "Goose"
▪️ ϕ-Decoding
▪️ Frac-connections
▪️ DAPO
▪️ Reinforcement learning for reasoning in small LLMs
▪️ MetaLadder
▪️ Measuring AI ability to complete long tasks
▪️ Why do multi-agent LLM systems fail?
▪️ Agents play thousands of 3D video games
▪️ GKG-LLM
▪️ Privacy, Synthetic Data, and Security
▪️ Scale-wise distillation of diffusion models
▪️ Multimodal chain-of-thought reasoning
▪️ Survey on evaluation of LLM-based agents
▪️ Stop overthinking: A survey on efficient reasoning
▪️ Aligning multimodal LLM with human preference

🧵Image
Image
Image
1. Xattention by @MIT, @Tsinghua_Uni, @sjtu1896 and @nvidia

Speeds up inference with block-sparse attention and antidiagonal scoring

huggingface.co/papers/2503.16…
Code: github.com/mit-han-lab/x-… Image
Mar 24 9 tweets 6 min read
7 open-source AI models of the week:

• @Microsoft’s KBLaM
• Fin-R1
• @nvidia’s Cosmos-Reason1
• @nvidia’s Cosmos-Transfer1
• M3 by @nvidia
• Tencent’s T1
• Roblox’ Cube

🧵 Image
Image
Image
1. @Microsoft’s KBLaM integrates structured knowledge into LLMs with rectangular attention for low-latency, hallucination-resistant answers.

microsoft.com/en-us/research…

Code and database: github.com/microsoft/KBLa… Image
Mar 24 5 tweets 2 min read
There’s no single “right” answer for AI models in creative writing (like creating a story tale), and their open-ended thinking is a key part of creative intelligence.

Still, models often lack output diversity, so @midjourney dropped an interesting study on this 👇

▪️ Their idea is to add diversity directly into the training process:

They measured response deviation for the same prompt and used it to train with DPO and ORPO, leading to more diversified DDPO and DORPO methods.

Here's how DDPO and DORPO work:Image 1. Diversified DPO (DDPO):

In the regular DPO method, the model learns by comparing a better response to a worse one.

In diversified version, researchers add more weight to rare or unique winning responses—those with higher deviation.

This helps the model pay more attention to uncommon but high-quality examples during training.
Mar 18 10 tweets 3 min read
DiLoCo (Distributed Low-Communication) method by @GoogleAI and @GoogleDeepMind changes how training of models happens:

Instead of constant syncing, multiple copies of the model are trained in parallel and sync only occasionally.

Scaling laws show how DiLoCo works as models' size grows🧵Image At its core, DiLoCo follows a 2-level optimization process:

• Inner optimization: Each model replica (M) trains independently, making local updates.

• Outer optimization: Every H steps, replicas sync their updates to adjust a global model, which is then shared with all replicas, repeating the cycle.

Here are scaling laws for DiLoCo:
Mar 11 8 tweets 4 min read
The latest AI/ML news of the week:

▪️ @perplexity_ai expands beyond the web
▪️ Manus: a Chinese high-performing AI agent
▪️ @Apple delayed Siri AI enhancements and new M3 Ultra chip
▪️ @CorticalLabs' CL1 computer fuses human brain cells with silicon
▪️ @MistralAI OCR
▪️ Andrew Barto and @RichardSSutton take home the 2024 Turing Award!

Find the details below 🧵Image
Image
Image
1. @perplexity_ai expands beyond the web

It partners with hardware firms to integrate its AI into everyday devices. This year, Deutsche Telekom’s AI Phone launches with Perplexity’s assistant, hinting at future moves. Phones for now, then TVs? Where next?

telekom.com/en/media/media…Image
Mar 10 9 tweets 6 min read
6 notable AI models of the week:

▪️ Differentiable Logic Cellular Automata @GoogleAI
▪️ Phi-4-Mini @Microsoft
▪️ Babel, Open Multilingual LLMs @AlibabaGroup
▪️ Aya Vision @CohereForAI
▪️ LLMVoX
▪️ LanDiff by Moonshot AI

🧵 Image
Image
Image
1. Differentiable Logic Cellular Automata @GoogleAI

Integrates Neural Cellular Automata with Differentiable Logic Gate Networks to enable self-healing, pattern generation, and robust computational architectures.

google-research.github.io/self-organisin…Image
Mar 8 7 tweets 3 min read
Speculative Mixture-of-Experts (s-MoE) makes running MoE-based LLMs faster by reducing the communication overhead between GPUs.

S-MoE uses 2 techniques:

• Speculative Token Reshuffling (s-TS):
Predicts which experts tokens will use, rearranging tokens early to minimize token movement later.

• Speculative Expert Pre-grouping(s-EG):
Groups experts handling similar tokens together in advance to reduce communication.

s-MoE almost doubles performance over DeepSpeed-MoE and SGLang frameworks.

Here are the details:Image Problem of MoE models

MoE inference efficiency is limited by Expert Parallelism (EP), as tokens are sent to specific experts located on different GPUs.

So tokes frequently move between GPUs, creating heavy communication overhead and slowing performance.

s-MoE can solve this👇 Image
Mar 8 7 tweets 3 min read
Contrastive Sparse Representation (CSR) by @XDUofChina is an effective alternative to Matryoshka Representation Learning (MRL) for creating embeddings.

MRL can change embedding lengths but needs retraining the entire model and loses accuracy with short embeddings.

CSR solves this problem by using sparse coding: It keeps embeddings longer but activates only a few parts (neurons), making them "sparse."

This makes CSR a simple, fast and accurate method.

Here's how it works:Image Working process:

CSR works differently from MRL because it starts with already-trained embeddings, converts them into sparse representations, and then activates only the most important features (TopK).

To ensure embeddings stay accurate and compact, it combines two losses👇
Mar 4 12 tweets 6 min read
The latest AI/ML news of the week:

▪️ @DeepSeek_ai: 6 extraordinary deliveries during #OpenSourceWeek
▪️ @AnthropicAI
- Claude 3.7 Sonnet
- Transparency Hub
- A fresh $3.5B Series E
▪️ @Google
- Gemini Code Assist free for all
- The AI co-scientist
▪️ @awscloud Center for Quantum Computing: Quantum error correction (QEC) scheme

Find the details below 🧵Image
Image
Image
1. @deepseek_ai delivered 6 major open-source AI optimizations

Explore in this thread 👇
Mar 4 8 tweets 3 min read
6 major AI optimizations by @DeepSeek_ai during #OpenSourceWeek:

- FlashMLA
- DeepEP
- DeepGEMM
- Optimized parallelism
- Fire-Flyer File System (3FS)
- DeepSeek-V3/R1 Inference System

🧵 Image 1. FlashMLA:

Optimized Multi-head Latent Attention (MLA) for Hopper GPUs, achieving 3000 GB/s memory bandwidth and 580 TFLOPS compute.

(already over 11k stars on GitHub!)github.com/deepseek-ai/Fl…
Mar 1 6 tweets 3 min read
SWE-RL from @AIatMeta - the first reinforcement learning (RL) method to improve AI for real-world software engineering tasks.

SWE-RL trains models by:

- Studying software evolution data from GitHub pull requests (PRs)
- Using simple rules to reward the model
- Teaching reasoning before coding

It solved 41.0% of issues in SWE-bench Verified - the best ever performance for medium-sized LLMs and improved general reasoning skills.

Here's how it works:Image 1. GitHub pull request (PR) data:

Researchers gathered 11M high-quality GitHub pull request (PR) data (seed dataset), linked it to real issues as training examples. It's structured to include issue descriptions, code context, and correct fixes.
Mar 1 6 tweets 2 min read
How do agents plan?

Here are 4 main planning techniques:

▪️ Classical AI planning (deliberative planning)
▪️ Reinforcement Learning (RL)
▪️ DeepSeek’s RL approach
▪️ Hierarchical planning

🧵 Image ▪️ Classical AI planning (deliberative planning)

Agents find action sequences to reach a goal, using predefined models (like STRIPS, PDDL) and search algorithms, like depth-first search or A*. In LLM-based systems, classical planning adds structure and reliability.
Feb 27 11 tweets 4 min read
Current LLM-serving systems treat each LLM call separately, causing delays in multi-step programs.

Autellix by @UCBerkeley, @GoogleDeepMind, and @sjtu1896 helps to fix it.

This new system "looks" at entire AI programs and schedules LLM requests based on the overall workflow.

• It makes AI programs run 4-15x faster, reducing waiting time and execution time for them.
• Allows the LLM engine to handle more requests at once.
• Maintains the same response speed.

Autellix achieves this through:
- smart scheduling
- efficient memory management
- better load balancing across multiple AI servers.

Here are the details:Image 1. Autellix improves scheduling in two ways to reduce delays:

• Program-aware prioritization: It tracks program history and prioritizes requests based on total execution time. Shorter programs get processed sooner.
• Preemptive scheduling: If a long request is slowing things down, it can be temporarily paused to allow shorter requests to go through first.
Feb 26 6 tweets 3 min read
MoBA, Mixture of Block Attention, from @Kimi_Moonshot improves handling long-context tasks with no fixed attention patterns.

Applying ideas from Mixture of Experts (MoE) to attention, MoBA lets the model dynamically decide where to focus.

This allows MoBA to be 6.5x faster than full attention for 1M tokens.

Here's how it works:Image Working process with everything in order:

- Instead of looking at everything at once, MoBA divides the text into smaller sections blocks.
- It scores block, groups and organizes them, prioritizing the most relevant ones for each task.
- Only the top-scoring blocks are used for attention.
- MoBA ensures that attention is only on past and present words, keeping the process natural and logical.
Feb 25 26 tweets 15 min read
The freshest AI/ML research of the week:

Our top 9
▪️ SigLIP 2
▪️ Intuitive Physics Understanding Emerges from Self-Supervised Pretraining on Natural Videos
▪️ Native Sparse Attention
▪️ OctoTools
▪️ ReLearn
▪️ On the Trustworthiness of Generative Foundation Models
▪️ S* Test Time Scaling for Code Generation
▪️ Autellix (Serving Engine for LLM Agents)
▪️ Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

▪️ SurveyX
▪️ From RAG to Memory: Non-Parametric Continual Learning for LLMs
▪️ How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
▪️ Train Small, Infer Large
▪️ Eager Updates for Overlapped Communication and Computation in DiLoCo
▪️ S^2R: Teaching LLMs to Self-verify and Self-correct via RL
▪️ Logic-RL
▪️ Discovering Highly Efficient Low-Weight Quantum Error-Correcting Codes with RL
▪️ Armap
▪️ Thinking Preference Optimization
▪️ Rethinking Diverse Human Preference Learning through Principal Component Analysis
▪️ Craw4LLM
▪️ LLMs and Mathematical Reasoning Failures
▪️ Small Models Struggle to Learn from Strong Reasoners
▪️ Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options

🧵Image
Image
1. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, @GoogleDeepMind

Advances vision-language learning with multilingual training and improved zero-shot capabilities

huggingface.co/papers/2502.14…
Checkpoints: github.com/google-researc… x.com/12714828789589…
Feb 18 5 tweets 2 min read
3 models to pay attention to:

▪️ LM2: Large Memory Models

- Uses a Transformer architecture with a memory module to improve long-context reasoning.
- Outperforms RMT by 37.1% and excels in multi-hop inference.

▪️ NatureLM:

- Is trained across scientific domains.
- Enhancing tasks like SMILES-to-IUPAC translation and CRISPR RNA design for cross-domain applications.

▪️ Goedel-Prover:

- Advances formal proof generation
- Achieves 57.6% Pass@32 on miniF2F using expert iteration and statement formalizers.

Find the links below👇Image
Image
Image
1. LM2: Large Memory Models by Convergence Labs Ltd.

huggingface.co/papers/2502.06…
Feb 18 24 tweets 14 min read
The freshest AI/ML research of the week:

Our top 7
▪️ Matryoshka Quantization
▪️ LLM Pretraining with Continuous Concepts
▪️ LLMs can easily learn to reason from demonstrations
▪️ Forget what you know about LLMs evaluations – LLMs are like a chameleon
▪️ Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
▪️ Hephaestus
▪️ SynthDetoxM Dataset

▪️ The Curse of Depth in LLMs
▪️ InfiniteHiP
▪️ Distillation Scaling Laws
▪️ TransMLA: Multi-Head Latent Attention
▪️ Logical reasoning in LLMs: A survey
▪️ ReasonFlux
▪️ How Stanford’s s1 surpasses DeepSeek-R1
▪️ The Stochastic Parrot on LLM’s Shoulder
▪️ Training LMs for Social Deduction with Multi-Agent RL
▪️ Towards Internet-scale training for agents
▪️ WorldGUI
▪️ CoSER: Coordinating LLM-Based Persona Simulation
▪️ Scaling Pre-training to One Hundred Billion Data for VLMs
▪️ Adapting Language-Specific LLMs to Reasoning Models

🧵Image
Image
Image
1. Matryoshka Quantization from @GoogleDeepMind

Introduces MatQuant, a multi-scale quantization method that mixes int2, int4, and int8 layers for efficient model deployment

huggingface.co/papers/2502.06… x.com/12714828789589…
Feb 16 8 tweets 4 min read
Free useful guides on model distillations:

1. Model Distillation guide from @OpenAI
2. Knowledge Distillation tutorial by @PyTorch
3. Jetson Introduction to Knowledge Distillation by @nvidia
4. Tutorial on Knowledge Distillation with @kerasteam
5. @huggingface's guides:
- Knowledge Distillation
- Knowledge Distillation for Computer Vision

Save the link and check out the links below 👇Image 1. Model Distillation guide from @OpenAI

Explains this process step-by step, including
- storing outputs from a large model
- evaluating both large and small models
- create training data for a small model
- assess the fine-tuned small model

platform.openai.com/docs/guides/di…
Feb 15 10 tweets 3 min read
Distillation involves using a large teacher model to train a smaller student one.

But can we predict a distilled model’s performance based on teacher quality, student size, data volume, etc.?

@Apple and @UniofOxford explored this and developed distillation scaling laws.

Here are the key takeaways👇Image 1. A good teacher doesn’t always mean a better student:

If a teacher is too strong, the student might struggle to learn from it, leading to worse performance.
This is called the capacity gap — when the student isn’t powerful enough to properly mimic the teacher.
Feb 10 23 tweets 13 min read
The freshest AI/ML research of the week:

Our top 4
▪️ AlphaGeometry2
▪️ ZebraLogic
▪️ Limo: Less is More for Reasoning
▪️ Great Models Think Alike and this Undermines AI Oversight

▪️ Activation-Informed Merging of LLMs
▪️ Content-Format Integrated Prompt Optimization (CFPO)
▪️ BOLT: Bootstrapping Long Chain-of-Thought
▪️ Token Assorted: Mixing Latent & Text Tokens
▪️ ScoreFlow
▪️ The Jumping Reasoning Curve?
▪️ Demystifying Long Chain-of-Thought Reasoning in LLMs
▪️ MAGA
▪️ ParetoQ: Scaling Laws in Extremely Low-Bit LLM Quantization
▪️ Analyze Feature Flow to Enhance Interpretation and Steering in LMs
▪️ PILAF
▪️ DuoGuard
▪️ Limitations of LLMs in Clinical Problem-Solving
▪️ AI and Legal Analysis
▪️ HackerRank-ASTRA
▪️ The Open-Source Advantage in LLMs
▪️ UltraIF: Advancing Instruction-Following

🧵Image
Image
Image
1. AlphaGeometry2 (Olympiad Geometry Solver) from @GoogleDeepMind

Enhances AlphaGeometry to solve IMO-level geometry problems with a broader formal language

huggingface.co/papers/2502.03… x.com/12714828789589…
Feb 10 6 tweets 3 min read
Sliding Tile Attention (STA) speeds up video generation up to 3.53x times.

It focuses only on small, relevant regions at a time and moves across the video in a sliding pattern.

STA processes larger chunks (tiles) at once, making it faster and more hardware-efficient.

Here's how it works:Image
Image
Firstly, what's wrong with current methods?

3D attention, that is generally used in Diffusion Transformers (DiTs), processes all video frames at once, treating every pixel separately, which takes up a huge amount of computing power—about 70% of the total effort.

The problem with traditional Sliding Window Attention (SWA) is that it creates "mixed blocks," which are inefficient for GPUs.

That's why researchers proposed Sliding Tile Attention (STA) method.