Overview:
BitNet b1.58 2B4T is a native 1-bit LLM with 2 billion parameters trained on 4 trillion tokens, matching the performance of comparable full-precision LLMs on tasks like language understanding and reasoning.
This 1-bit architecture demonstrates substantial improvements in computational efficiency, marked by reduced memory footprint, energy usage, and faster decoding latency.
- Inference-Time Scaling for Generalist Reward Modeling
- Multi-Token Attention
- Why do LLMs attend to the first token?
- Command A
- LLMs Pass the Turing Test
- Advances and Challenges in Foundation Agents
- PaperBench
- Effectively Controlling Reasoning Models through Thinking Intervention
- TransMamba
- Open-Reasoner-Zero
- Scaling Tool-Integrated RL
- Scaling Language-Free Visual Representation Learning
- Output Constraints as Attack Surface
- Large (Vision) Language Models are Unsupervised In-Context Learners
- Memorizing is Not Enough
- ShortV
- MegaScale-Infer
- What the F*ck Is Artificial General Intelligence?
- Prompting Forgetting
- Enlightenment Period Improving DNN Performance
overview for each + authors' explanations
read this in thread mode for the best experience
Inference-Time Scaling for Generalist Reward Modeling
Overview:
This work from DeepSeek investigates inference-time scalability for generalist reward modeling (RM) in LLMs, utilizing pointwise generative reward modeling (GRM) for flexibility.
It introduces Self-Principled Critique Tuning (SPCT), an online RL method, to train DeepSeek-GRM models that adaptively generate principles and critiques for improved reward accuracy.
To enhance inference-time scaling, the study employs parallel sampling guided by a meta RM, demonstrating significantly improved quality and scalability on various RM benchmarks compared to existing methods and potentially exceeding training-time scaling benefits.
- Transformers without Normalization
- Block Diffusion
- Compute Optimal Scaling of Skills
- DAPO: An OS LLM RL System at Scale
- Teaching LLMs How to Learn with Contextual Fine-Tuning
- GR00T N1
- Why the Brain Cannot Be a Digital Computer
- RWKV-7 "Goose" with Expressive Dynamic State Evolution
- Why Do Multi-Agent LLM Systems Fail?
- Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
- Light-R1
- Where do Large Vision-Language Models Look at when Answering Questions?
- Improving Planning of Agents for Long-Horizon Tasks
- UniCombine
- How much do LLMs learn from negative examples?
- Tokenize Image as a Set
- Search-R1
- Measuring AI Ability to Complete Long Tasks
- Does Your VLM Get Lost in the Long Video Sampling Dilemma?
- Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
- Personalize Anything for Free with Diffusion Transformer
- The KoLMogorov Test: Compression by Code Generation
- Optimizing ML Training with Metagradient Descent
overview for each + authors' explanations
read this in thread mode for the best experience
Transformers without Normalization
Overview:
Transformers can achieve or surpass normalized performance using a simple technique called Dynamic Tanh (DyT), replacing normalization layers with an element-wise operation inspired by tanh-like mappings observed in layer norm, validated across various tasks in computer vision and LLMs.
- LLM Pretraining with Continuous Concepts
- Distillation Scaling Laws
- Can 1B LLM Surpass 405B LLM?
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Emergent Response Planning in LLM
- Improving Existing Optimization Algorithms with LLMs
- Training Language Models for Social Deduction with Multi-Agent RL
- Multi-Head Latent Attention Is All You Need
- Generative Modeling with Bayesian Sample Inference
- Scaling Pre-training to One Hundred Billion Data for Vision Language Models
- NatureLM: Deciphering the Language of Nature for Scientific Discovery
- Competitive Programming with Large Reasoning Models
- Matryoshka Quantization
overview for each + authors' explanations
read this in thread mode for the best experience
LLM Pretraining with Continuous Concepts
Overview:
CoCoMix, a pretraining framework, combines standard next token prediction with continuous concept learning derived from a pretrained sparse autoencoder and mixes these into the model’s hidden state by interleaving with token hidden representations.
This approach improves sample efficiency of LLMs and consistently surpasses next token prediction, knowledge distillation, and pause token insertion across language modeling and reasoning tasks.
The integration of concept learning also enhances model interpretability and steerability.
- Demystifying Long Chain-of-Thought Reasoning in LLMs
- OmniHuman-1
- LIMO
- s1: Simple test-time scaling
- Process Reinforcement through Implicit Rewards
- Iterate to Accelerate
- Efficient Reasoning with Hidden Thinking
- Fully Autonomous AI Agents Should Not be Developed
- DeepRAG
- Scalable-Softmax Is Superior for Attention
- The Differences Between Direct Alignment Algorithms are a Blur
- Preference Leakage
- SafeRAG
- Analyze Feature Flow to Enhance Interpretation and Steering in LMs
- Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
- ConceptAttention
- Weak-to-Strong Diffusion with Reflection
- Great Models Think Alike and this Undermines AI Oversight
- SmolLM2
- Inverse Bridge Matching Distillation
- Rethinking Mixture-of-Agents
overview for each + authors' explanations
read this in thread mode for the best experience
Demystifying Long Chain-of-Thought Reasoning in LLMs
Overview:
Long Chain-of-Thought reasoning in LLMs, which enables strategies like backtracking and error correction, is significantly enhanced by scaling inference compute, and the process can be optimized through reinforcement learning.
Although supervised fine-tuning (SFT) simplifies training and improves efficiency, it's the scaling of verifiable reward signals, particularly through noisy, web-extracted solutions, which proves critical for RL especially in out-of-distribution tasks like STEM reasoning.
Basic capabilities such as error correction are initially present, incentivizing these skills for complex tasks via RL necessitates substantial compute, and measuring the development of capabilities needs an elaborate approach.
> Do generative video models learn physical principles from watching videos?
> Transformer^2: Self-adaptive LLMs
> MiniMax-01
> The Lessons of Developing Process Reward Models in Mathematical Reasoning
> Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
> Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
> Critical Tokens Matter
> Distilling Multi-modal Large Language Models for Autonomous Driving
> OmniThink
> Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
> MangaNinja
> Diffusion Adversarial Post-Training for One-Step Video Generation
overview for each + authors' explanations
read this in thread mode for the best experience
Do generative video models learn physical principles from watching videos?
Overview:
This paper investigates whether generative video models learn physical principles by introducing Physics-IQ, a benchmark requiring an understanding of physics like fluid dynamics and magnetism.
So although some models like Sora and VideoPoet can solve specific test cases, their overall physical understanding is limited, suggesting that visual realism does not equate to comprehension of physical laws.
- OpenAI o1 System Card
- PaliGemma 2
- HunyuanVideo
- Densing Law of LLMs
- DeMo: Decoupled Momentum Optimization
- o1-Coder
- Reverse Thinking Makes LLMs Stronger Reasoners
- Efficient Track Anything
- NVILA: Efficient Frontier VLMs
- Agent Skill Acquisition for LLMs via CycleQD
- A Noise is Worth Diffusion Guidance
- VisionZip: Longer is Better but Not Necessary in VLMs
- Infinity: Scaling Bitwise AutoRegressive Modeling for High-Res Image Synthesis
- Evaluating Language Models as Synthetic Data Generators
- Critical Tokens Matter
- SNOOPI
- TokenFlow
- MALT: Improving Reasoning with Multi-Agent LLM Training
- X-Prompt
- Video Depth without Video Models
- GRAPE: Generalizing Robot Policy via Preference Alignment
- Beyond Examples
- Scaling Transformers for Low-Bitrate High-Quality Speech Coding
- Retrieval-Augmented Reasoning Enhancement for LLMs
- Best-of-N Jailbreaking
- Composition of Experts
- Mind the Gap: Examining the Self-Improvement Capabilities of LLMs
- Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
- Large-Scale T2I Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
- JetFormer
- Proactive Agent
- Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
- Distillation-Based NAS for Inference-Optimized LLMs
- Navigation World Models
overview for each + authors' explanations
read this in thread mode for the best experience
OpenAI o1 System Card
OpenAI is currently hosting an event called 12 Days of OpenAI. openai.com/12-days/
On day 1, they released the full version of OpenAI-o1. Along with a $200/mo tier with uncapped usage.
About the technical report:
The o1 model series employs large-scale reinforcement learning with chain-of-thought reasoning, achieving state-of-the-art safety performance by mitigating risks like generating illicit advice, choosing biased responses, and resisting jailbreaks.
By reasoning about safety policies in context, these models enhance robustness but also introduce risks tied to advanced intelligence, emphasizing the need for rigorous alignment, stress testing, and risk management.
Safety evaluations, external red teaming, and Preparedness Framework assessments are the main topics of the report's analysis.
- Sparse Crosscoders
- Rethinking Softmax
- Mechanistic Unlearning
- Decomposing The Dark Matter of Sparse Autoencoders
- ZIP-FIT
- Automatically Interpreting Millions of Features in Large Language Models
- Breaking the Memory Barrier
- Can Knowledge Editing Really Correct Hallucinations?
- Framer: Interactive Frame Interpolation
- Beyond position
- A Hitchhiker's Guide to Scaling Law Estimation
- Scaling up Masked Diffusion Models on Text
- Why Does the Effective Context Length of LLMs Fall Short?
- Scaling Diffusion Language Models via Adaptation from Autoregressive Models
- Improve Vision Language Model Chain-of-thought Reasoning
- PyramidDrop
- FrugalNeRF
- SAM2Long
- SeerAttention
- FiTv2
overview for each + authors' explanations
read this in thread mode for the best experience
Sparse Crosscoders for Cross-Layer Features and Model Diffing
Overview:
This research introduces "sparse crosscoders," a tool that tracks shared features across layers in neural networks, simplifying feature analysis and model comparisons.
Crosscoders support long-term feature tracking, streamline circuit analysis by removing redundant features, and detect unique model differences, aiding in fine-tuning and architecture studies.
Early results show they outperform per-layer methods in capturing cross-layer structures, though with higher computational cost.
🚨This week’s top AI/ML research paper:
LM
- Q-Sparse
- SpreadsheetLLM (MSFT)
- Questionable practices in machine learning
- Accuracy is Not All You Need (MSFT)
- Qwen2 Technical Report
- Does Refusal Training in LLMs Generalize to the Past Tense?
- Prover-Verifier Games improve legibility of LLM outputs (OpenAI)
- Scaling Laws with Vocabulary
- Transformer Layers as Painters (Sakana AI)
- GoldFinch (EleutherAI)
- AgentPoison
- NeedleBench
- Human-like Episodic Memory for Infinite Context LLMs
- Weak-to-Strong Reasoning
- Implicit meta-learning may lead language models to trust more reliable sources
AI Gen
- Shape of Motion
- Splatfacto-W
- Scaling Diffusion Transformers to 16 Billion Parameters
- Qwen2-Audio Technical Report
- JASCO (Meta)
overview for each & authors' explanations
read this in thread mode for the best experience
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
author's explanation:
Overview:
Q-Sparse is an effective method for training sparsely-activated LLMs, and it achieves full sparsity of activations by applying top-K sparsification and using the straight-through estimator during training.
Q-Sparse achieves comparable results to baseline LLMs with greater efficiency, presenting an inference-optimal scaling law for sparsely-activated LLMs, and being effective in various training scenarios.
Additionally, Q-Sparse works for both full-precision and 1-bit LLMs, with notable synergy when combined with BitNet b1.58, paving the way for more efficient and cost-effective future LLMs.
📈 Top AI/ML research paper (week June 23 - 30) with overview for each & authors' explanations:
- Gemma 2
- The FineWeb Datasets
- Adam-mini
- One Thousand and One Pairs
- LLMs' Classification Performance is Overclaimed
- GraphReader
- Cambrian-1
- LongRAG
- MUMU
- EAGLE-2
- WARP
- Optimised Grouped-Query Attention Mechanism for Transformers
- Octo-planner
- Step-DPO
- OMG-LLaVA
- Following Length Constraints in Instructions
- BigCodeBench
- DreamBench++
- YouDream
- Fantastic Copyrighted Beasts and How (Not) to Generate Them
read this in thread mode for the best experience
Gemma 2: Improving Open Language Models at a Practical Size
explained by author (?)
Overview:
Gemma 2 introduces lightweight, state-of-the-art models ranging from 2 to 27 billion parameters, with 9B and 27B available right now. Key updates include interleaving local-global attentions and group-query attention.
Using knowledge distillation for training, the models perform exceptionally well for their size, even competing with much larger models.