Our top 9
▪️ SigLIP 2
▪️ Intuitive Physics Understanding Emerges from Self-Supervised Pretraining on Natural Videos
▪️ Native Sparse Attention
▪️ OctoTools
▪️ ReLearn
▪️ On the Trustworthiness of Generative Foundation Models
▪️ S* Test Time Scaling for Code Generation
▪️ Autellix (Serving Engine for LLM Agents)
▪️ Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
▪️ SurveyX
▪️ From RAG to Memory: Non-Parametric Continual Learning for LLMs
▪️ How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
▪️ Train Small, Infer Large
▪️ Eager Updates for Overlapped Communication and Computation in DiLoCo
▪️ S^2R: Teaching LLMs to Self-verify and Self-correct via RL
▪️ Logic-RL
▪️ Discovering Highly Efficient Low-Weight Quantum Error-Correcting Codes with RL
▪️ Armap
▪️ Thinking Preference Optimization
▪️ Rethinking Diverse Human Preference Learning through Principal Component Analysis
▪️ Craw4LLM
▪️ LLMs and Mathematical Reasoning Failures
▪️ Small Models Struggle to Learn from Strong Reasoners
▪️ Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options
- Uses a Transformer architecture with a memory module to improve long-context reasoning.
- Outperforms RMT by 37.1% and excels in multi-hop inference.
▪️ NatureLM:
- Is trained across scientific domains.
- Enhancing tasks like SMILES-to-IUPAC translation and CRISPR RNA design for cross-domain applications.
▪️ Goedel-Prover:
- Advances formal proof generation
- Achieves 57.6% Pass@32 on miniF2F using expert iteration and statement formalizers.
Find the links below👇
1. LM2: Large Memory Models by Convergence Labs Ltd.
Our top 7
▪️ Matryoshka Quantization
▪️ LLM Pretraining with Continuous Concepts
▪️ LLMs can easily learn to reason from demonstrations
▪️ Forget what you know about LLMs evaluations – LLMs are like a chameleon
▪️ Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
▪️ Hephaestus
▪️ SynthDetoxM Dataset
▪️ The Curse of Depth in LLMs
▪️ InfiniteHiP
▪️ Distillation Scaling Laws
▪️ TransMLA: Multi-Head Latent Attention
▪️ Logical reasoning in LLMs: A survey
▪️ ReasonFlux
▪️ How Stanford’s s1 surpasses DeepSeek-R1
▪️ The Stochastic Parrot on LLM’s Shoulder
▪️ Training LMs for Social Deduction with Multi-Agent RL
▪️ Towards Internet-scale training for agents
▪️ WorldGUI
▪️ CoSER: Coordinating LLM-Based Persona Simulation
▪️ Scaling Pre-training to One Hundred Billion Data for VLMs
▪️ Adapting Language-Specific LLMs to Reasoning Models
🧵
1. Matryoshka Quantization from @GoogleDeepMind
Introduces MatQuant, a multi-scale quantization method that mixes int2, int4, and int8 layers for efficient model deployment
1. Model Distillation guide from @OpenAI 2. Knowledge Distillation tutorial by @PyTorch 3. Jetson Introduction to Knowledge Distillation by @nvidia 4. Tutorial on Knowledge Distillation with @kerasteam 5. @huggingface's guides:
- Knowledge Distillation
- Knowledge Distillation for Computer Vision
Save the link and check out the links below 👇
1. Model Distillation guide from @OpenAI
Explains this process step-by step, including
- storing outputs from a large model
- evaluating both large and small models
- create training data for a small model
- assess the fine-tuned small model
2. Knowledge Distillation tutorial by @PyTorch covers:
• Extracting hidden representations for further calculations
• Modifying PyTorch training loops to include additional losses
• Enhancing lightweight models using complex models as teachers
Distillation involves using a large teacher model to train a smaller student one.
But can we predict a distilled model’s performance based on teacher quality, student size, data volume, etc.?
@Apple and @UniofOxford explored this and developed distillation scaling laws.
Here are the key takeaways👇
1. A good teacher doesn’t always mean a better student:
If a teacher is too strong, the student might struggle to learn from it, leading to worse performance.
This is called the capacity gap — when the student isn’t powerful enough to properly mimic the teacher.
2. Distillation scaling law predicts how well a student model will perform based on three key factors:
- Student model's size
- The number of training tokens
- The teacher’s size and quality
This law follows a "power law" relationship, which means that performance improves in a predictable way but only to a point. Then adding more resources doesn’t help.
Sliding Tile Attention (STA) speeds up video generation up to 3.53x times.
It focuses only on small, relevant regions at a time and moves across the video in a sliding pattern.
STA processes larger chunks (tiles) at once, making it faster and more hardware-efficient.
Here's how it works:
Firstly, what's wrong with current methods?
3D attention, that is generally used in Diffusion Transformers (DiTs), processes all video frames at once, treating every pixel separately, which takes up a huge amount of computing power—about 70% of the total effort.
The problem with traditional Sliding Window Attention (SWA) is that it creates "mixed blocks," which are inefficient for GPUs.