Latest Twitter Threads by @TheAITimeline on Thread Reader App

Jun 15 • 27 tweets • 22 min read

🚨This week's top AI/ML research papers:

- Self-Adapting Language Models
- V-JEPA 2
- The Illusion of the Illusion of Thinking
- Magistral
- Reinforcement Pre-Training
- VideoDeepResearch
- Unsupervised Elicitation of LMs
- CoRT
- The Diffusion Duality
- Ming-Omni
- One Tokenizer To Rule Them All
- Build the web for agents, not agents for the web
- Learning What RL Can't
- Hidden in plain sight
- ViTs Don't Need Trained Registers
- Play to Generalize
- Thinking vs. Doing
- Self Forcing
- Highly Compressed Tokenizer Can Generate Without Training
- RiemannFormer
- Seedance 1.0
- Edit Flows
- Resa
- On a few pitfalls in KL divergence gradient estimation for RL
- e3

overview for each + authors' explanations
read this in thread mode for the best experience

Self-Adapting Language Models

Author's Explanation:
x.com/jyo_pari/statu…

Overview:
The Self-Adapting LLMs (SEAL) framework enables LLMs to self-adapt by generating their own finetuning data and update directives.

These "self-edits" facilitate persistent weight updates through supervised finetuning, with the entire process trained via a reinforcement learning loop that optimizes for downstream performance, showing a promising step toward self-directed adaptation in tasks like knowledge incorporation and few-shot generalization.

Paper:
arxiv.org/abs/2506.10943

Jun 1 • 26 tweets • 21 min read

🚨This week's top AI/ML research papers:

- Spurious Rewards
- FLUX.1 Kontext
- Learning to Reason without External Rewards
- Reasoning LLMs are Wandering Solution Explorers
- VLM-3R
- Silence is Not Consensus
- Beyond Markovian
- The Entropy Mechanism of RL for Reasoning LMs
- ATLAS
- Fractured CoT Reasoning
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
- Pixel Reasoner
- Fast-dLLM
- Accelerating Diffusion LM Inference via Efficient KV Caching and Guided Diffusion
- Reinforcing General Reasoning without Verifiers
- Hardware-Efficient Attention for Fast Decoding
- Temporal Sampling for Forgotten Reasoning in LLMs
- Long-Context State-Space Video World Models
- DeepTheorem
- Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
- Estimating the Effects of Sample Training Orders for LLMs without Retraining
- Darwin Godel Machine
- Grounded Reinforcement Learning for Visual Reasoning

overview for each + authors' explanations
read this in thread mode for the best experience

Spurious Rewards: Rethinking Training Signals in RLVR

Author's Explanation:
x.com/StellaLisy/sta…

Overview:
RLVR enables strong mathematical reasoning in Qwen2.5-Math-7B even with spurious rewards, achieving up to 26.5% gains on MATH-500, nearly matching improvements from ground truth rewards.

These gains are driven by increased code reasoning behaviors, rising from 66.7% to over 90%, despite no code execution. However, similar spurious signals fail on models like Llama3 and OLMo2, demonstrating the need to test RLVR across diverse architectures and model families.

This suggests RLVR surfaces latent pretrained reasoning rather than optimizing for reward correctness.

Paper:
github.com/ruixin31/Rethi…

May 11 • 21 tweets • 16 min read

🚨This week's top AI/ML research papers:

- Absolute Zero
- RM-R1
- Seed-Coder
- Flow-GRPO
- ZeroSearch
- Ming-Lite-Uni
- A Survey on Large Multimodal Reasoning Models
- On Path to Multimodal Generalist
- ZeroSearch
- HunyuanCustom
- Unified Multimodal CoT Reward Model through Reinforcement Fine-Tuning
- Grokking in the Wild
- Voila
- Llama-Nemotron
- An Empirical Study of Qwen3 Quantization
- DiffVQA
- Scalable Chain of Thoughts via Elastic Reasoning
- R1-Reward
- Pangu Ultra MoE

overview for each + authors' explanations
read this in thread mode for the best experience

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Author's Explanation:
x.com/AndrewZ4573249…

Overview:
Absolute Zero introduces a new RLVR (RL with verifiable rewards) framework where a model autonomously generates and solves tasks to optimize its learning, eliminating the need for human-curated data.

The Absolute Zero Reasoner (AZR) uses a code executor for task validation and answer verification, providing verifiable rewards for open-ended, self-supervised training.

AZR achieves SoTA performance on coding and mathematical reasoning benchmarks despite using no external data, outperforming zero-setting baselines trained on large human-curated datasets. The approach generalizes across model sizes and architectures.

Paper:
arxiv.org/abs/2505.03335

Apr 21 • 22 tweets • 18 min read

🚨This week's top AI/ML research papers:

- BitNet b1.58 2B4T Technical Report
- Reasoning Models Can Be Effective Without Thinking
- ReTool
- Sleep-time Compute
- Nemotron-H
- Kimina-Prover Preview
- CLIMB
- Dynamic Cheatsheet
- How new data permeates LLM knowledge and how to dilute it
- InternVL3
- MIEB
- REPA-E
- Seedream 3.0 Technical Report
- Looking beyond the next token
- DataDecide
- Autoregressive Distillation of Diffusion Transformers
- Perception Encoder
- M1: Mamba Reasoning Models
- d1: Reasoning in Diffusion LLMs
- Antidistillation Sampling

overview for each + authors' explanations
read this in thread mode for the best experience

BitNet b1.58 2B4T Technical Report

Author's Explanation:
x.com/realHongyu_Wan…

Overview:
BitNet b1.58 2B4T is a native 1-bit LLM with 2 billion parameters trained on 4 trillion tokens, matching the performance of comparable full-precision LLMs on tasks like language understanding and reasoning.

This 1-bit architecture demonstrates substantial improvements in computational efficiency, marked by reduced memory footprint, energy usage, and faster decoding latency.

Paper:
arxiv.org/abs/2504.12285

Apr 6 • 22 tweets • 18 min read

🚨This week's top AI/ML research papers:

- Inference-Time Scaling for Generalist Reward Modeling
- Multi-Token Attention
- Why do LLMs attend to the first token?
- Command A
- LLMs Pass the Turing Test
- Advances and Challenges in Foundation Agents
- PaperBench
- Effectively Controlling Reasoning Models through Thinking Intervention
- TransMamba
- Open-Reasoner-Zero
- Scaling Tool-Integrated RL
- Scaling Language-Free Visual Representation Learning
- Output Constraints as Attack Surface
- Large (Vision) Language Models are Unsupervised In-Context Learners
- Memorizing is Not Enough
- ShortV
- MegaScale-Infer
- What the F*ck Is Artificial General Intelligence?
- Prompting Forgetting
- Enlightenment Period Improving DNN Performance

overview for each + authors' explanations
read this in thread mode for the best experience

Inference-Time Scaling for Generalist Reward Modeling

Author's Explanation:
x.com/tuzhaopeng/sta…

Overview:
This work from DeepSeek investigates inference-time scalability for generalist reward modeling (RM) in LLMs, utilizing pointwise generative reward modeling (GRM) for flexibility.

It introduces Self-Principled Critique Tuning (SPCT), an online RL method, to train DeepSeek-GRM models that adaptively generate principles and critiques for improved reward accuracy.

To enhance inference-time scaling, the study employs parallel sampling guided by a meta RM, demonstrating significantly improved quality and scalability on various RM benchmarks compared to existing methods and potentially exceeding training-time scaling benefits.

Paper:
arxiv.org/abs/2504.02495

Mar 22 • 25 tweets • 20 min read

🚨 Last 2 week's top AI/ML research papers:

- Transformers without Normalization
- Block Diffusion
- Compute Optimal Scaling of Skills
- DAPO: An OS LLM RL System at Scale
- Teaching LLMs How to Learn with Contextual Fine-Tuning
- GR00T N1
- Why the Brain Cannot Be a Digital Computer
- RWKV-7 "Goose" with Expressive Dynamic State Evolution
- Why Do Multi-Agent LLM Systems Fail?
- Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
- Light-R1
- Where do Large Vision-Language Models Look at when Answering Questions?
- Improving Planning of Agents for Long-Horizon Tasks
- UniCombine
- How much do LLMs learn from negative examples?
- Tokenize Image as a Set
- Search-R1
- Measuring AI Ability to Complete Long Tasks
- Does Your VLM Get Lost in the Long Video Sampling Dilemma?
- Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
- Personalize Anything for Free with Diffusion Transformer
- The KoLMogorov Test: Compression by Code Generation
- Optimizing ML Training with Metagradient Descent

overview for each + authors' explanations
read this in thread mode for the best experience

Transformers without Normalization

Author's Explanation:
x.com/liuzhuang1234/…

Overview:
Transformers can achieve or surpass normalized performance using a simple technique called Dynamic Tanh (DyT), replacing normalization layers with an element-wise operation inspired by tanh-like mappings observed in layer norm, validated across various tasks in computer vision and LLMs.

Paper:
arxiv.org/abs/2503.10622

Feb 17 • 15 tweets • 12 min read

🚨This week's top AI/ML research papers:

- LLM Pretraining with Continuous Concepts
- Distillation Scaling Laws
- Can 1B LLM Surpass 405B LLM?
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Emergent Response Planning in LLM
- Improving Existing Optimization Algorithms with LLMs
- Training Language Models for Social Deduction with Multi-Agent RL
- Multi-Head Latent Attention Is All You Need
- Generative Modeling with Bayesian Sample Inference
- Scaling Pre-training to One Hundred Billion Data for Vision Language Models
- NatureLM: Deciphering the Language of Nature for Scientific Discovery
- Competitive Programming with Large Reasoning Models
- Matryoshka Quantization

overview for each + authors' explanations
read this in thread mode for the best experience

LLM Pretraining with Continuous Concepts

Author's Explanation:
x.com/jaseweston/sta…

Overview:
CoCoMix, a pretraining framework, combines standard next token prediction with continuous concept learning derived from a pretrained sparse autoencoder and mixes these into the model’s hidden state by interleaving with token hidden representations.

This approach improves sample efficiency of LLMs and consistently surpasses next token prediction, knowledge distillation, and pause token insertion across language modeling and reasoning tasks.

The integration of concept learning also enhances model interpretability and steerability.

Paper:
arxiv.org/abs/2502.08524

Feb 9 • 23 tweets • 19 min read

🚨This week's top AI/ML research papers:

- Demystifying Long Chain-of-Thought Reasoning in LLMs
- OmniHuman-1
- LIMO
- s1: Simple test-time scaling
- Process Reinforcement through Implicit Rewards
- Iterate to Accelerate
- Efficient Reasoning with Hidden Thinking
- Fully Autonomous AI Agents Should Not be Developed
- DeepRAG
- Scalable-Softmax Is Superior for Attention
- The Differences Between Direct Alignment Algorithms are a Blur
- Preference Leakage
- SafeRAG
- Analyze Feature Flow to Enhance Interpretation and Steering in LMs
- Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
- ConceptAttention
- Weak-to-Strong Diffusion with Reflection
- Great Models Think Alike and this Undermines AI Oversight
- SmolLM2
- Inverse Bridge Matching Distillation
- Rethinking Mixture-of-Agents

overview for each + authors' explanations
read this in thread mode for the best experience

Demystifying Long Chain-of-Thought Reasoning in LLMs

Author's Explanation:
x.com/xiangyue96/sta…

Overview:
Long Chain-of-Thought reasoning in LLMs, which enables strategies like backtracking and error correction, is significantly enhanced by scaling inference compute, and the process can be optimized through reinforcement learning.

Although supervised fine-tuning (SFT) simplifies training and improves efficiency, it's the scaling of verifiable reward signals, particularly through noisy, web-extracted solutions, which proves critical for RL especially in out-of-distribution tasks like STEM reasoning.

Basic capabilities such as error correction are initially present, incentivizing these skills for complex tasks via RL necessitates substantial compute, and measuring the development of capabilities needs an elaborate approach.

Paper:
arxiv.org/abs/2502.03373

Jan 20 • 15 tweets • 12 min read

🚨This week's top AI/ML research papers:

> Do generative video models learn physical principles from watching videos?
> Transformer^2: Self-adaptive LLMs
> MiniMax-01
> The Lessons of Developing Process Reward Models in Mathematical Reasoning
> Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
> Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
> Critical Tokens Matter
> Distilling Multi-modal Large Language Models for Autonomous Driving
> OmniThink
> Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
> MangaNinja
> Diffusion Adversarial Post-Training for One-Step Video Generation

overview for each + authors' explanations
read this in thread mode for the best experience

Do generative video models learn physical principles from watching videos?

Overview:
This paper investigates whether generative video models learn physical principles by introducing Physics-IQ, a benchmark requiring an understanding of physics like fluid dynamics and magnetism.

So although some models like Sora and VideoPoet can solve specific test cases, their overall physical understanding is limited, suggesting that visual realism does not equate to comprehension of physical laws.

Paper:
arxiv.org/abs/2501.09038

Dec 7, 2024 • 37 tweets • 30 min read

🚨This week’s top AI/ML research papers:

- OpenAI o1 System Card
- PaliGemma 2
- HunyuanVideo
- Densing Law of LLMs
- DeMo: Decoupled Momentum Optimization
- o1-Coder
- Reverse Thinking Makes LLMs Stronger Reasoners
- Efficient Track Anything
- NVILA: Efficient Frontier VLMs
- Agent Skill Acquisition for LLMs via CycleQD
- A Noise is Worth Diffusion Guidance
- VisionZip: Longer is Better but Not Necessary in VLMs
- Infinity: Scaling Bitwise AutoRegressive Modeling for High-Res Image Synthesis
- Evaluating Language Models as Synthetic Data Generators
- Critical Tokens Matter
- SNOOPI
- TokenFlow
- MALT: Improving Reasoning with Multi-Agent LLM Training
- X-Prompt
- Video Depth without Video Models
- GRAPE: Generalizing Robot Policy via Preference Alignment
- Beyond Examples
- Scaling Transformers for Low-Bitrate High-Quality Speech Coding
- Retrieval-Augmented Reasoning Enhancement for LLMs
- Best-of-N Jailbreaking
- Composition of Experts
- Mind the Gap: Examining the Self-Improvement Capabilities of LLMs
- Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
- Large-Scale T2I Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
- JetFormer
- Proactive Agent
- Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
- Distillation-Based NAS for Inference-Optimized LLMs
- Navigation World Models

overview for each + authors' explanations
read this in thread mode for the best experience

OpenAI o1 System Card

Author’s Thread:
x.com/OpenAI/status/…

OpenAI is currently hosting an event called 12 Days of OpenAI. openai.com/12-days/

On day 1, they released the full version of OpenAI-o1. Along with a $200/mo tier with uncapped usage.

About the technical report:
The o1 model series employs large-scale reinforcement learning with chain-of-thought reasoning, achieving state-of-the-art safety performance by mitigating risks like generating illicit advice, choosing biased responses, and resisting jailbreaks.

By reasoning about safety policies in context, these models enhance robustness but also introduce risks tied to advanced intelligence, emphasizing the need for rigorous alignment, stress testing, and risk management.

Safety evaluations, external red teaming, and Preparedness Framework assessments are the main topics of the report's analysis.

Paper:
cdn.openai.com/o1-system-card…

Oct 26, 2024 • 23 tweets • 21 min read

🚨This week’s top AI/ML research papers:

- Sparse Crosscoders
- Rethinking Softmax
- Mechanistic Unlearning
- Decomposing The Dark Matter of Sparse Autoencoders
- ZIP-FIT
- Automatically Interpreting Millions of Features in Large Language Models
- Breaking the Memory Barrier
- Can Knowledge Editing Really Correct Hallucinations?
- Framer: Interactive Frame Interpolation
- Beyond position
- A Hitchhiker's Guide to Scaling Law Estimation
- Scaling up Masked Diffusion Models on Text
- Why Does the Effective Context Length of LLMs Fall Short?
- Scaling Diffusion Language Models via Adaptation from Autoregressive Models
- Improve Vision Language Model Chain-of-thought Reasoning
- PyramidDrop
- FrugalNeRF
- SAM2Long
- SeerAttention
- FiTv2

overview for each + authors' explanations
read this in thread mode for the best experience

Sparse Crosscoders for Cross-Layer Features and Model Diffing

Author's Explanation:
x.com/AnthropicAI/st…

Overview:
This research introduces "sparse crosscoders," a tool that tracks shared features across layers in neural networks, simplifying feature analysis and model comparisons.

Crosscoders support long-term feature tracking, streamline circuit analysis by removing redundant features, and detect unique model differences, aiding in fine-tuning and architecture studies.

Early results show they outperform per-layer methods in capturing cross-layer structures, though with higher computational cost.

Blog:
transformer-circuits.pub/2024/crosscode…

Jul 20, 2024 • 25 tweets • 24 min read

🚨This week’s top AI/ML research paper:
LM
- Q-Sparse
- SpreadsheetLLM (MSFT)
- Questionable practices in machine learning
- Accuracy is Not All You Need (MSFT)
- Qwen2 Technical Report
- Does Refusal Training in LLMs Generalize to the Past Tense?
- Prover-Verifier Games improve legibility of LLM outputs (OpenAI)
- Scaling Laws with Vocabulary
- Transformer Layers as Painters (Sakana AI)
- GoldFinch (EleutherAI)
- AgentPoison
- NeedleBench
- Human-like Episodic Memory for Infinite Context LLMs
- Weak-to-Strong Reasoning
- Implicit meta-learning may lead language models to trust more reliable sources

AI Gen
- Shape of Motion
- Splatfacto-W
- Scaling Diffusion Transformers to 16 Billion Parameters
- Qwen2-Audio Technical Report
- JASCO (Meta)

Others
- LookupViT (DeepMind)
- xLSTMTime
- REGLE (Google Research)

overview for each & authors' explanations
read this in thread mode for the best experience

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

author's explanation:

Overview:
Q-Sparse is an effective method for training sparsely-activated LLMs, and it achieves full sparsity of activations by applying top-K sparsification and using the straight-through estimator during training.

Q-Sparse achieves comparable results to baseline LLMs with greater efficiency, presenting an inference-optimal scaling law for sparsely-activated LLMs, and being effective in various training scenarios.

Additionally, Q-Sparse works for both full-precision and 1-bit LLMs, with notable synergy when combined with BitNet b1.58, paving the way for more efficient and cost-effective future LLMs.

Paper:

https://x.com/realHongyu_Wang/status/1813112679734911169

arxiv.org/abs/2407.10969

Jun 30, 2024 • 21 tweets • 19 min read

📈 Top AI/ML research paper (week June 23 - 30) with overview for each & authors' explanations:
- Gemma 2
- The FineWeb Datasets
- Adam-mini
- One Thousand and One Pairs
- LLMs' Classification Performance is Overclaimed
- GraphReader
- Cambrian-1
- LongRAG
- MUMU
- EAGLE-2
- WARP
- Optimised Grouped-Query Attention Mechanism for Transformers
- Octo-planner
- Step-DPO
- OMG-LLaVA
- Following Length Constraints in Instructions
- BigCodeBench
- DreamBench++
- YouDream
- Fantastic Copyrighted Beasts and How (Not) to Generate Them

read this in thread mode for the best experience Gemma 2: Improving Open Language Models at a Practical Size

explained by author (?)

Overview:
Gemma 2 introduces lightweight, state-of-the-art models ranging from 2 to 27 billion parameters, with 9B and 27B available right now. Key updates include interleaving local-global attentions and group-query attention.

Using knowledge distillation for training, the models perform exceptionally well for their size, even competing with much larger models.

Blog:

Paper:

Huggingface:

https://x.com/JeffDean/status/1760291769252762110

blog.google/technology/dev…
storage.googleapis.com/deepmind-media…
huggingface.co/collections/go…

Share this page!

Enter URL or ID to Unroll