Newsletter exploring AI & ML - AI 101 - ML techniques - AI Business insights - Global dynamics - ML History Led by @kseniase_ Save hours of research 👇🏼
2 subscribers
Mar 25 • 20 tweets • 12 min read
The freshest AI/ML research of the week:
Our top 2
▪️ Xattention
▪️ Inside-Out: Hidden Factual Knowledge in LLMs
▪️ Rwkv-7 "Goose"
▪️ ϕ-Decoding
▪️ Frac-connections
▪️ DAPO
▪️ Reinforcement learning for reasoning in small LLMs
▪️ MetaLadder
▪️ Measuring AI ability to complete long tasks
▪️ Why do multi-agent LLM systems fail?
▪️ Agents play thousands of 3D video games
▪️ GKG-LLM
▪️ Privacy, Synthetic Data, and Security
▪️ Scale-wise distillation of diffusion models
▪️ Multimodal chain-of-thought reasoning
▪️ Survey on evaluation of LLM-based agents
▪️ Stop overthinking: A survey on efficient reasoning
▪️ Aligning multimodal LLM with human preference
🧵 1. Xattention by @MIT, @Tsinghua_Uni, @sjtu1896 and @nvidia
Speeds up inference with block-sparse attention and antidiagonal scoring
There’s no single “right” answer for AI models in creative writing (like creating a story tale), and their open-ended thinking is a key part of creative intelligence.
Still, models often lack output diversity, so @midjourney dropped an interesting study on this 👇
▪️ Their idea is to add diversity directly into the training process:
They measured response deviation for the same prompt and used it to train with DPO and ORPO, leading to more diversified DDPO and DORPO methods.
Here's how DDPO and DORPO work:1. Diversified DPO (DDPO):
In the regular DPO method, the model learns by comparing a better response to a worse one.
In diversified version, researchers add more weight to rare or unique winning responses—those with higher deviation.
This helps the model pay more attention to uncommon but high-quality examples during training.
Mar 18 • 10 tweets • 3 min read
DiLoCo (Distributed Low-Communication) method by @GoogleAI and @GoogleDeepMind changes how training of models happens:
Instead of constant syncing, multiple copies of the model are trained in parallel and sync only occasionally.
Scaling laws show how DiLoCo works as models' size grows🧵
At its core, DiLoCo follows a 2-level optimization process:
• Inner optimization: Each model replica (M) trains independently, making local updates.
• Outer optimization: Every H steps, replicas sync their updates to adjust a global model, which is then shared with all replicas, repeating the cycle.
Here are scaling laws for DiLoCo:
Mar 11 • 8 tweets • 4 min read
The latest AI/ML news of the week:
▪️ @perplexity_ai expands beyond the web
▪️ Manus: a Chinese high-performing AI agent
▪️ @Apple delayed Siri AI enhancements and new M3 Ultra chip
▪️ @CorticalLabs' CL1 computer fuses human brain cells with silicon
▪️ @MistralAI OCR
▪️ Andrew Barto and @RichardSSutton take home the 2024 Turing Award!
Find the details below 🧵 1. @perplexity_ai expands beyond the web
It partners with hardware firms to integrate its AI into everyday devices. This year, Deutsche Telekom’s AI Phone launches with Perplexity’s assistant, hinting at future moves. Phones for now, then TVs? Where next?
Speculative Mixture-of-Experts (s-MoE) makes running MoE-based LLMs faster by reducing the communication overhead between GPUs.
S-MoE uses 2 techniques:
• Speculative Token Reshuffling (s-TS):
Predicts which experts tokens will use, rearranging tokens early to minimize token movement later.
• Speculative Expert Pre-grouping(s-EG):
Groups experts handling similar tokens together in advance to reduce communication.
s-MoE almost doubles performance over DeepSpeed-MoE and SGLang frameworks.
Here are the details:
Problem of MoE models
MoE inference efficiency is limited by Expert Parallelism (EP), as tokens are sent to specific experts located on different GPUs.
So tokes frequently move between GPUs, creating heavy communication overhead and slowing performance.
s-MoE can solve this👇
Mar 8 • 7 tweets • 3 min read
Contrastive Sparse Representation (CSR) by @XDUofChina is an effective alternative to Matryoshka Representation Learning (MRL) for creating embeddings.
MRL can change embedding lengths but needs retraining the entire model and loses accuracy with short embeddings.
CSR solves this problem by using sparse coding: It keeps embeddings longer but activates only a few parts (neurons), making them "sparse."
This makes CSR a simple, fast and accurate method.
Here's how it works:
Working process:
CSR works differently from MRL because it starts with already-trained embeddings, converts them into sparse representations, and then activates only the most important features (TopK).
To ensure embeddings stay accurate and compact, it combines two losses👇
Mar 4 • 12 tweets • 6 min read
The latest AI/ML news of the week:
▪️ @DeepSeek_ai: 6 extraordinary deliveries during #OpenSourceWeek
▪️ @AnthropicAI
- Claude 3.7 Sonnet
- Transparency Hub
- A fresh $3.5B Series E
▪️ @Google
- Gemini Code Assist free for all
- The AI co-scientist
▪️ @awscloud Center for Quantum Computing: Quantum error correction (QEC) scheme
Find the details below 🧵 1. @deepseek_ai delivered 6 major open-source AI optimizations
SWE-RL from @AIatMeta - the first reinforcement learning (RL) method to improve AI for real-world software engineering tasks.
SWE-RL trains models by:
- Studying software evolution data from GitHub pull requests (PRs)
- Using simple rules to reward the model
- Teaching reasoning before coding
It solved 41.0% of issues in SWE-bench Verified - the best ever performance for medium-sized LLMs and improved general reasoning skills.
Here's how it works:1. GitHub pull request (PR) data:
Researchers gathered 11M high-quality GitHub pull request (PR) data (seed dataset), linked it to real issues as training examples. It's structured to include issue descriptions, code context, and correct fixes.
🧵
▪️ Classical AI planning (deliberative planning)
Agents find action sequences to reach a goal, using predefined models (like STRIPS, PDDL) and search algorithms, like depth-first search or A*. In LLM-based systems, classical planning adds structure and reliability.
Feb 27 • 11 tweets • 4 min read
Current LLM-serving systems treat each LLM call separately, causing delays in multi-step programs.
Autellix by @UCBerkeley, @GoogleDeepMind, and @sjtu1896 helps to fix it.
This new system "looks" at entire AI programs and schedules LLM requests based on the overall workflow.
• It makes AI programs run 4-15x faster, reducing waiting time and execution time for them.
• Allows the LLM engine to handle more requests at once.
• Maintains the same response speed.
Autellix achieves this through:
- smart scheduling
- efficient memory management
- better load balancing across multiple AI servers.
Here are the details:1. Autellix improves scheduling in two ways to reduce delays:
• Program-aware prioritization: It tracks program history and prioritizes requests based on total execution time. Shorter programs get processed sooner.
• Preemptive scheduling: If a long request is slowing things down, it can be temporarily paused to allow shorter requests to go through first.
Feb 26 • 6 tweets • 3 min read
MoBA, Mixture of Block Attention, from @Kimi_Moonshot improves handling long-context tasks with no fixed attention patterns.
Applying ideas from Mixture of Experts (MoE) to attention, MoBA lets the model dynamically decide where to focus.
This allows MoBA to be 6.5x faster than full attention for 1M tokens.
Here's how it works:
Working process with everything in order:
- Instead of looking at everything at once, MoBA divides the text into smaller sections blocks.
- It scores block, groups and organizes them, prioritizing the most relevant ones for each task.
- Only the top-scoring blocks are used for attention.
- MoBA ensures that attention is only on past and present words, keeping the process natural and logical.
Feb 25 • 26 tweets • 15 min read
The freshest AI/ML research of the week:
Our top 9
▪️ SigLIP 2
▪️ Intuitive Physics Understanding Emerges from Self-Supervised Pretraining on Natural Videos
▪️ Native Sparse Attention
▪️ OctoTools
▪️ ReLearn
▪️ On the Trustworthiness of Generative Foundation Models
▪️ S* Test Time Scaling for Code Generation
▪️ Autellix (Serving Engine for LLM Agents)
▪️ Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
▪️ SurveyX
▪️ From RAG to Memory: Non-Parametric Continual Learning for LLMs
▪️ How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
▪️ Train Small, Infer Large
▪️ Eager Updates for Overlapped Communication and Computation in DiLoCo
▪️ S^2R: Teaching LLMs to Self-verify and Self-correct via RL
▪️ Logic-RL
▪️ Discovering Highly Efficient Low-Weight Quantum Error-Correcting Codes with RL
▪️ Armap
▪️ Thinking Preference Optimization
▪️ Rethinking Diverse Human Preference Learning through Principal Component Analysis
▪️ Craw4LLM
▪️ LLMs and Mathematical Reasoning Failures
▪️ Small Models Struggle to Learn from Strong Reasoners
▪️ Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options
- Uses a Transformer architecture with a memory module to improve long-context reasoning.
- Outperforms RMT by 37.1% and excels in multi-hop inference.
▪️ NatureLM:
- Is trained across scientific domains.
- Enhancing tasks like SMILES-to-IUPAC translation and CRISPR RNA design for cross-domain applications.
▪️ Goedel-Prover:
- Advances formal proof generation
- Achieves 57.6% Pass@32 on miniF2F using expert iteration and statement formalizers.
Find the links below👇 1. LM2: Large Memory Models by Convergence Labs Ltd.
Our top 7
▪️ Matryoshka Quantization
▪️ LLM Pretraining with Continuous Concepts
▪️ LLMs can easily learn to reason from demonstrations
▪️ Forget what you know about LLMs evaluations – LLMs are like a chameleon
▪️ Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
▪️ Hephaestus
▪️ SynthDetoxM Dataset
▪️ The Curse of Depth in LLMs
▪️ InfiniteHiP
▪️ Distillation Scaling Laws
▪️ TransMLA: Multi-Head Latent Attention
▪️ Logical reasoning in LLMs: A survey
▪️ ReasonFlux
▪️ How Stanford’s s1 surpasses DeepSeek-R1
▪️ The Stochastic Parrot on LLM’s Shoulder
▪️ Training LMs for Social Deduction with Multi-Agent RL
▪️ Towards Internet-scale training for agents
▪️ WorldGUI
▪️ CoSER: Coordinating LLM-Based Persona Simulation
▪️ Scaling Pre-training to One Hundred Billion Data for VLMs
▪️ Adapting Language-Specific LLMs to Reasoning Models
🧵 1. Matryoshka Quantization from @GoogleDeepMind
Introduces MatQuant, a multi-scale quantization method that mixes int2, int4, and int8 layers for efficient model deployment
1. Model Distillation guide from @OpenAI 2. Knowledge Distillation tutorial by @PyTorch 3. Jetson Introduction to Knowledge Distillation by @nvidia 4. Tutorial on Knowledge Distillation with @kerasteam 5. @huggingface's guides:
- Knowledge Distillation
- Knowledge Distillation for Computer Vision
Save the link and check out the links below 👇1. Model Distillation guide from @OpenAI
Explains this process step-by step, including
- storing outputs from a large model
- evaluating both large and small models
- create training data for a small model
- assess the fine-tuned small model
Distillation involves using a large teacher model to train a smaller student one.
But can we predict a distilled model’s performance based on teacher quality, student size, data volume, etc.?
@Apple and @UniofOxford explored this and developed distillation scaling laws.
Here are the key takeaways👇1. A good teacher doesn’t always mean a better student:
If a teacher is too strong, the student might struggle to learn from it, leading to worse performance.
This is called the capacity gap — when the student isn’t powerful enough to properly mimic the teacher.
Feb 10 • 23 tweets • 13 min read
The freshest AI/ML research of the week:
Our top 4
▪️ AlphaGeometry2
▪️ ZebraLogic
▪️ Limo: Less is More for Reasoning
▪️ Great Models Think Alike and this Undermines AI Oversight
▪️ Activation-Informed Merging of LLMs
▪️ Content-Format Integrated Prompt Optimization (CFPO)
▪️ BOLT: Bootstrapping Long Chain-of-Thought
▪️ Token Assorted: Mixing Latent & Text Tokens
▪️ ScoreFlow
▪️ The Jumping Reasoning Curve?
▪️ Demystifying Long Chain-of-Thought Reasoning in LLMs
▪️ MAGA
▪️ ParetoQ: Scaling Laws in Extremely Low-Bit LLM Quantization
▪️ Analyze Feature Flow to Enhance Interpretation and Steering in LMs
▪️ PILAF
▪️ DuoGuard
▪️ Limitations of LLMs in Clinical Problem-Solving
▪️ AI and Legal Analysis
▪️ HackerRank-ASTRA
▪️ The Open-Source Advantage in LLMs
▪️ UltraIF: Advancing Instruction-Following
🧵 1. AlphaGeometry2 (Olympiad Geometry Solver) from @GoogleDeepMind
Enhances AlphaGeometry to solve IMO-level geometry problems with a broader formal language
Sliding Tile Attention (STA) speeds up video generation up to 3.53x times.
It focuses only on small, relevant regions at a time and moves across the video in a sliding pattern.
STA processes larger chunks (tiles) at once, making it faster and more hardware-efficient.
Here's how it works:
Firstly, what's wrong with current methods?
3D attention, that is generally used in Diffusion Transformers (DiTs), processes all video frames at once, treating every pixel separately, which takes up a huge amount of computing power—about 70% of the total effort.
The problem with traditional Sliding Window Attention (SWA) is that it creates "mixed blocks," which are inefficient for GPUs.