The AI Timeline Profile picture
head to "highlights" for all weekly top AI/ML research papers
3 subscribers
Apr 21 22 tweets 18 min read
🚨This week's top AI/ML research papers:

- BitNet b1.58 2B4T Technical Report
- Reasoning Models Can Be Effective Without Thinking
- ReTool
- Sleep-time Compute
- Nemotron-H
- Kimina-Prover Preview
- CLIMB
- Dynamic Cheatsheet
- How new data permeates LLM knowledge and how to dilute it
- InternVL3
- MIEB
- REPA-E
- Seedream 3.0 Technical Report
- Looking beyond the next token
- DataDecide
- Autoregressive Distillation of Diffusion Transformers
- Perception Encoder
- M1: Mamba Reasoning Models
- d1: Reasoning in Diffusion LLMs
- Antidistillation Sampling

overview for each + authors' explanations
read this in thread mode for the best experienceImage BitNet b1.58 2B4T Technical Report

Author's Explanation:
x.com/realHongyu_Wan…

Overview:
BitNet b1.58 2B4T is a native 1-bit LLM with 2 billion parameters trained on 4 trillion tokens, matching the performance of comparable full-precision LLMs on tasks like language understanding and reasoning.

This 1-bit architecture demonstrates substantial improvements in computational efficiency, marked by reduced memory footprint, energy usage, and faster decoding latency.

Paper:
arxiv.org/abs/2504.12285Image
Apr 6 22 tweets 18 min read
🚨This week's top AI/ML research papers:

- Inference-Time Scaling for Generalist Reward Modeling
- Multi-Token Attention
- Why do LLMs attend to the first token?
- Command A
- LLMs Pass the Turing Test
- Advances and Challenges in Foundation Agents
- PaperBench
- Effectively Controlling Reasoning Models through Thinking Intervention
- TransMamba
- Open-Reasoner-Zero
- Scaling Tool-Integrated RL
- Scaling Language-Free Visual Representation Learning
- Output Constraints as Attack Surface
- Large (Vision) Language Models are Unsupervised In-Context Learners
- Memorizing is Not Enough
- ShortV
- MegaScale-Infer
- What the F*ck Is Artificial General Intelligence?
- Prompting Forgetting
- Enlightenment Period Improving DNN Performance

overview for each + authors' explanations
read this in thread mode for the best experienceImage Inference-Time Scaling for Generalist Reward Modeling

Author's Explanation:
x.com/tuzhaopeng/sta…

Overview:
This work from DeepSeek investigates inference-time scalability for generalist reward modeling (RM) in LLMs, utilizing pointwise generative reward modeling (GRM) for flexibility.

It introduces Self-Principled Critique Tuning (SPCT), an online RL method, to train DeepSeek-GRM models that adaptively generate principles and critiques for improved reward accuracy.

To enhance inference-time scaling, the study employs parallel sampling guided by a meta RM, demonstrating significantly improved quality and scalability on various RM benchmarks compared to existing methods and potentially exceeding training-time scaling benefits.

Paper:
arxiv.org/abs/2504.02495Image
Mar 22 25 tweets 20 min read
🚨 Last 2 week's top AI/ML research papers:

- Transformers without Normalization
- Block Diffusion
- Compute Optimal Scaling of Skills
- DAPO: An OS LLM RL System at Scale
- Teaching LLMs How to Learn with Contextual Fine-Tuning
- GR00T N1
- Why the Brain Cannot Be a Digital Computer
- RWKV-7 "Goose" with Expressive Dynamic State Evolution
- Why Do Multi-Agent LLM Systems Fail?
- Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
- Light-R1
- Where do Large Vision-Language Models Look at when Answering Questions?
- Improving Planning of Agents for Long-Horizon Tasks
- UniCombine
- How much do LLMs learn from negative examples?
- Tokenize Image as a Set
- Search-R1
- Measuring AI Ability to Complete Long Tasks
- Does Your VLM Get Lost in the Long Video Sampling Dilemma?
- Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
- Personalize Anything for Free with Diffusion Transformer
- The KoLMogorov Test: Compression by Code Generation
- Optimizing ML Training with Metagradient Descent

overview for each + authors' explanations
read this in thread mode for the best experienceImage Transformers without Normalization

Author's Explanation:
x.com/liuzhuang1234/…

Overview:
Transformers can achieve or surpass normalized performance using a simple technique called Dynamic Tanh (DyT), replacing normalization layers with an element-wise operation inspired by tanh-like mappings observed in layer norm, validated across various tasks in computer vision and LLMs.

Paper:
arxiv.org/abs/2503.10622Image
Feb 17 15 tweets 12 min read
🚨This week's top AI/ML research papers:

- LLM Pretraining with Continuous Concepts
- Distillation Scaling Laws
- Can 1B LLM Surpass 405B LLM?
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Emergent Response Planning in LLM
- Improving Existing Optimization Algorithms with LLMs
- Training Language Models for Social Deduction with Multi-Agent RL
- Multi-Head Latent Attention Is All You Need
- Generative Modeling with Bayesian Sample Inference
- Scaling Pre-training to One Hundred Billion Data for Vision Language Models
- NatureLM: Deciphering the Language of Nature for Scientific Discovery
- Competitive Programming with Large Reasoning Models
- Matryoshka Quantization

overview for each + authors' explanations
read this in thread mode for the best experienceImage LLM Pretraining with Continuous Concepts

Author's Explanation:
x.com/jaseweston/sta…

Overview:
CoCoMix, a pretraining framework, combines standard next token prediction with continuous concept learning derived from a pretrained sparse autoencoder and mixes these into the model’s hidden state by interleaving with token hidden representations.

This approach improves sample efficiency of LLMs and consistently surpasses next token prediction, knowledge distillation, and pause token insertion across language modeling and reasoning tasks.

The integration of concept learning also enhances model interpretability and steerability.

Paper:
arxiv.org/abs/2502.08524Image
Feb 9 23 tweets 19 min read
🚨This week's top AI/ML research papers:

- Demystifying Long Chain-of-Thought Reasoning in LLMs
- OmniHuman-1
- LIMO
- s1: Simple test-time scaling
- Process Reinforcement through Implicit Rewards
- Iterate to Accelerate
- Efficient Reasoning with Hidden Thinking
- Fully Autonomous AI Agents Should Not be Developed
- DeepRAG
- Scalable-Softmax Is Superior for Attention
- The Differences Between Direct Alignment Algorithms are a Blur
- Preference Leakage
- SafeRAG
- Analyze Feature Flow to Enhance Interpretation and Steering in LMs
- Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
- ConceptAttention
- Weak-to-Strong Diffusion with Reflection
- Great Models Think Alike and this Undermines AI Oversight
- SmolLM2
- Inverse Bridge Matching Distillation
- Rethinking Mixture-of-Agents

overview for each + authors' explanations
read this in thread mode for the best experienceImage Demystifying Long Chain-of-Thought Reasoning in LLMs

Author's Explanation:
x.com/xiangyue96/sta…

Overview:
Long Chain-of-Thought reasoning in LLMs, which enables strategies like backtracking and error correction, is significantly enhanced by scaling inference compute, and the process can be optimized through reinforcement learning.

Although supervised fine-tuning (SFT) simplifies training and improves efficiency, it's the scaling of verifiable reward signals, particularly through noisy, web-extracted solutions, which proves critical for RL especially in out-of-distribution tasks like STEM reasoning.

Basic capabilities such as error correction are initially present, incentivizing these skills for complex tasks via RL necessitates substantial compute, and measuring the development of capabilities needs an elaborate approach.

Paper:
arxiv.org/abs/2502.03373Image
Jan 20 15 tweets 12 min read
🚨This week's top AI/ML research papers:

> Do generative video models learn physical principles from watching videos?
> Transformer^2: Self-adaptive LLMs
> MiniMax-01
> The Lessons of Developing Process Reward Models in Mathematical Reasoning
> Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
> Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
> Critical Tokens Matter
> Distilling Multi-modal Large Language Models for Autonomous Driving
> OmniThink
> Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
> MangaNinja
> Diffusion Adversarial Post-Training for One-Step Video Generation

overview for each + authors' explanations
read this in thread mode for the best experienceImage Do generative video models learn physical principles from watching videos?

Overview:
This paper investigates whether generative video models learn physical principles by introducing Physics-IQ, a benchmark requiring an understanding of physics like fluid dynamics and magnetism.

So although some models like Sora and VideoPoet can solve specific test cases, their overall physical understanding is limited, suggesting that visual realism does not equate to comprehension of physical laws.

Paper:
arxiv.org/abs/2501.09038
Dec 7, 2024 37 tweets 30 min read
🚨This week’s top AI/ML research papers:

- OpenAI o1 System Card
- PaliGemma 2
- HunyuanVideo
- Densing Law of LLMs
- DeMo: Decoupled Momentum Optimization
- o1-Coder
- Reverse Thinking Makes LLMs Stronger Reasoners
- Efficient Track Anything
- NVILA: Efficient Frontier VLMs
- Agent Skill Acquisition for LLMs via CycleQD
- A Noise is Worth Diffusion Guidance
- VisionZip: Longer is Better but Not Necessary in VLMs
- Infinity: Scaling Bitwise AutoRegressive Modeling for High-Res Image Synthesis
- Evaluating Language Models as Synthetic Data Generators
- Critical Tokens Matter
- SNOOPI
- TokenFlow
- MALT: Improving Reasoning with Multi-Agent LLM Training
- X-Prompt
- Video Depth without Video Models
- GRAPE: Generalizing Robot Policy via Preference Alignment
- Beyond Examples
- Scaling Transformers for Low-Bitrate High-Quality Speech Coding
- Retrieval-Augmented Reasoning Enhancement for LLMs
- Best-of-N Jailbreaking
- Composition of Experts
- Mind the Gap: Examining the Self-Improvement Capabilities of LLMs
- Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
- Large-Scale T2I Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
- JetFormer
- Proactive Agent
- Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
- Distillation-Based NAS for Inference-Optimized LLMs
- Navigation World Models

overview for each + authors' explanations
read this in thread mode for the best experienceImage OpenAI o1 System Card

Author’s Thread:
x.com/OpenAI/status/…

OpenAI is currently hosting an event called 12 Days of OpenAI. openai.com/12-days/

On day 1, they released the full version of OpenAI-o1. Along with a $200/mo tier with uncapped usage.

About the technical report:
The o1 model series employs large-scale reinforcement learning with chain-of-thought reasoning, achieving state-of-the-art safety performance by mitigating risks like generating illicit advice, choosing biased responses, and resisting jailbreaks.

By reasoning about safety policies in context, these models enhance robustness but also introduce risks tied to advanced intelligence, emphasizing the need for rigorous alignment, stress testing, and risk management.

Safety evaluations, external red teaming, and Preparedness Framework assessments are the main topics of the report's analysis.

Paper:
cdn.openai.com/o1-system-card…Image
Oct 26, 2024 23 tweets 21 min read
🚨This week’s top AI/ML research papers:

- Sparse Crosscoders
- Rethinking Softmax
- Mechanistic Unlearning
- Decomposing The Dark Matter of Sparse Autoencoders
- ZIP-FIT
- Automatically Interpreting Millions of Features in Large Language Models
- Breaking the Memory Barrier
- Can Knowledge Editing Really Correct Hallucinations?
- Framer: Interactive Frame Interpolation
- Beyond position
- A Hitchhiker's Guide to Scaling Law Estimation
- Scaling up Masked Diffusion Models on Text
- Why Does the Effective Context Length of LLMs Fall Short?
- Scaling Diffusion Language Models via Adaptation from Autoregressive Models
- Improve Vision Language Model Chain-of-thought Reasoning
- PyramidDrop
- FrugalNeRF
- SAM2Long
- SeerAttention
- FiTv2

overview for each + authors' explanations
read this in thread mode for the best experienceImage Sparse Crosscoders for Cross-Layer Features and Model Diffing

Author's Explanation:
x.com/AnthropicAI/st…

Overview:
This research introduces "sparse crosscoders," a tool that tracks shared features across layers in neural networks, simplifying feature analysis and model comparisons.

Crosscoders support long-term feature tracking, streamline circuit analysis by removing redundant features, and detect unique model differences, aiding in fine-tuning and architecture studies.

Early results show they outperform per-layer methods in capturing cross-layer structures, though with higher computational cost.

Blog:
transformer-circuits.pub/2024/crosscode…Image
Jul 20, 2024 25 tweets 24 min read
🚨This week’s top AI/ML research paper:
LM
- Q-Sparse
- SpreadsheetLLM (MSFT)
- Questionable practices in machine learning
- Accuracy is Not All You Need (MSFT)
- Qwen2 Technical Report
- Does Refusal Training in LLMs Generalize to the Past Tense?
- Prover-Verifier Games improve legibility of LLM outputs (OpenAI)
- Scaling Laws with Vocabulary
- Transformer Layers as Painters (Sakana AI)
- GoldFinch (EleutherAI)
- AgentPoison
- NeedleBench
- Human-like Episodic Memory for Infinite Context LLMs
- Weak-to-Strong Reasoning
- Implicit meta-learning may lead language models to trust more reliable sources

AI Gen
- Shape of Motion
- Splatfacto-W
- Scaling Diffusion Transformers to 16 Billion Parameters
- Qwen2-Audio Technical Report
- JASCO (Meta)

Others
- LookupViT (DeepMind)
- xLSTMTime
- REGLE (Google Research)

overview for each & authors' explanations
read this in thread mode for the best experienceImage Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

author's explanation:


Overview:
Q-Sparse is an effective method for training sparsely-activated LLMs, and it achieves full sparsity of activations by applying top-K sparsification and using the straight-through estimator during training.

Q-Sparse achieves comparable results to baseline LLMs with greater efficiency, presenting an inference-optimal scaling law for sparsely-activated LLMs, and being effective in various training scenarios.

Additionally, Q-Sparse works for both full-precision and 1-bit LLMs, with notable synergy when combined with BitNet b1.58, paving the way for more efficient and cost-effective future LLMs.

Paper:

arxiv.org/abs/2407.10969Image
Jun 30, 2024 21 tweets 19 min read
📈 Top AI/ML research paper (week June 23 - 30) with overview for each & authors' explanations:
- Gemma 2
- The FineWeb Datasets
- Adam-mini
- One Thousand and One Pairs
- LLMs' Classification Performance is Overclaimed
- GraphReader
- Cambrian-1
- LongRAG
- MUMU
- EAGLE-2
- WARP
- Optimised Grouped-Query Attention Mechanism for Transformers
- Octo-planner
- Step-DPO
- OMG-LLaVA
- Following Length Constraints in Instructions
- BigCodeBench
- DreamBench++
- YouDream
- Fantastic Copyrighted Beasts and How (Not) to Generate Them

read this in thread mode for the best experience Gemma 2: Improving Open Language Models at a Practical Size

explained by author (?)


Overview:
Gemma 2 introduces lightweight, state-of-the-art models ranging from 2 to 27 billion parameters, with 9B and 27B available right now. Key updates include interleaving local-global attentions and group-query attention.

Using knowledge distillation for training, the models perform exceptionally well for their size, even competing with much larger models.

Blog:


Paper:


Huggingface:

blog.google/technology/dev…
storage.googleapis.com/deepmind-media…
huggingface.co/collections/go…Image