Overview:
BitNet b1.58 2B4T is a native 1-bit LLM with 2 billion parameters trained on 4 trillion tokens, matching the performance of comparable full-precision LLMs on tasks like language understanding and reasoning.
This 1-bit architecture demonstrates substantial improvements in computational efficiency, marked by reduced memory footprint, energy usage, and faster decoding latency.
Reasoning Models Can Be Effective Without Thinking
Overview:
This research questions the necessity of explicit "Thinking" steps for LLM reasoning, demonstrating that bypassing this process via simple "NoThinking" prompting is effective.
Controlling for token budget, NoThinking substantially outperforms explicit Thinking on diverse reasoning tasks including mathematical problem solving and coding, showing notable gains in low-budget settings (e.g., 51.3 vs 28.9 on ACM 23).
The paper introduces a parallel scaling approach where multiple independent NoThinking outputs are generated and aggregated using verifiers or best-of-N strategies.
This parallel method achieves better performance than Thinking baselines at similar latency and matches Thinking results requiring significantly more latency (up to 9x).
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Overview:
ReTool enhances LLMs for structured problem-solving by integrating real-time code execution with natural language reasoning through reinforcement learning.
The framework features dynamic interleaving of code and text, employing an automated RL paradigm where the model learns optimal tool invocation strategies from outcome feedback without human priors.
On the challenging MATH Olympiad benchmark AIME, ReTool-32B achieves 67% accuracy, substantially outperforming text-based RL baselines, and reaches 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%.
The approach also leads to emergent behaviors like code self-correction during complex reasoning tasks.
Sleep-time Compute: Beyond Inference Scaling at Test-time
Overview:
Sleep-time compute allows LLMs to perform computations offline by anticipating user queries, aiming to reduce the high latency and cost associated with scaling test-time inference.
On modified reasoning tasks like Stateful GSM-Symbolic and Stateful AIME, this method reduces necessary test-time compute by approximately 5x for equivalent accuracy.
Scaling sleep-time compute further boosts accuracy by up to 18% on these tasks, and amortizing this computation across related queries decreases average cost per query by 2.5x.
The effectiveness of sleep-time compute correlates with the predictability of user queries.
Overview:
Nemotron-H presents a family of efficient 8B and 56B hybrid Mamba-Transformer models that replace most self-attention with Mamba layers, delivering up to 3x faster inference than comparable state-of-the-art Transformers at similar accuracy levels, alongside a MiniPuzzle compressed 47B variant providing an additional 20% speedup and an FP8 training recipe achieving BF16-parity.
Overview:
This paper introduces Kimina-Prover Preview, an LLM trained with a large-scale reinforcement learning pipeline from Qwen2.5-72B, pioneering a reasoning-driven exploration paradigm for formal theorem proving in Lean 4.
Employing a novel formal reasoning pattern, the model mimics human problem-solving strategies to achieve state-of-the-art performance on the miniF2F benchmark, reaching 80.7% pass@8192.
Kimina-Prover shows high sample efficiency, delivering strong results with minimal sampling and effective scaling with computational budget.
Furthermore, the work demonstrates clear performance scaling with model size, a trend previously unobserved for neural theorem provers, and its learned reasoning style differs from traditional search algorithms.
Overview:
This paper introduces CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework to discover and refine optimal pre-training data mixtures from unlabeled corpora.
The method embeds and clusters large datasets in a semantic space, then iteratively searches for optimal mixtures using a smaller proxy model and a performance predictor.
Continuous training on 400B tokens with a CLIMB-optimized mixture allows a 1B parameter model to outperform Llama-3.2-1B by 2.0%, while optimizing for a specific domain boosts performance by 5% over random sampling.
Overview:
Dynamic Cheatsheet (DC) introduces a framework endowing black-box LLMs with persistent, self-curated memory for test-time learning, allowing reuse of strategies and code snippets.
This method substantially improves performance without ground-truth labels, doubling accuracy on AIME math exams and increasing Game of 24 success from 10% to 99% by recalling effective solutions.
Additional gains were demonstrated on knowledge-intensive tasks like GPQA-Diamond (+9%) and MMLU-Pro (+8%), all achieved without modifying the underlying LLM parameters.
Overview:
This research investigates how new information integrates into LLMs, identifying a "priming" effect where learning specific facts leads to their inappropriate application in unrelated contexts.
Using the introduced "Outlandish" dataset, the study demonstrates that the extent of this priming can be predicted by analyzing key token probabilities before learning, a finding consistent across various model architectures and sizes.
Two novel methods, "stepping-stone" text augmentation and "ignore-k" update pruning, are proposed to modulate this knowledge permeation.
These techniques substantially reduce undesirable priming effects by 50-95% while preserving the LLM's ability to learn new information accurately.
Overview:
InternVL3 introduces a native multimodal pre-training paradigm, enabling the joint acquisition of multimodal and linguistic capabilities from diverse data sources within a single stage, circumventing typical MLLM alignment challenges.
This approach incorporates Variable Visual Position Encoding (V2PE) for extended contexts, advanced post-training techniques like SFT and MPO, and test-time scaling strategies.
The InternVL3-78B model achieves a state-of-the-art 72.2 score on the MMMU benchmark among open-source MLLMs, proving highly competitive with leading proprietary models while retaining strong language proficiency.
Overview:
This work introduces the Massive Image Embedding Benchmark (MIEB) for comprehensive evaluation of image and image-text models across 130 tasks in 38 languages, grouped into 8 high-level categories.
Benchmarking 50 models reveals no single dominant method, showing strong visual text representation but limited capabilities with interleaved encodings and confounders.
Encoder performance on MIEB correlates highly with their effectiveness in multimodal LLMs.
Overview:
REPA-E enables effective end-to-end training of variational auto-encoders (VAEs) alongside latent diffusion transformers, addressing the limitations of standard diffusion loss for joint optimization.
By employing a representation-alignment (REPA) loss, REPA-E facilitates simultaneous tuning of both the VAE and the diffusion model.
This approach significantly accelerates diffusion model training by over 17x and 45x compared to previous methods and improves the VAE's latent structure, leading to state-of-the-art generation performance on ImageNet.
Overview:
This paper introduces Seedream 3.0, a significantly improved Chinese-English bilingual image foundation model featuring pipeline advancements from data construction using defect-aware training and dual-axis sampling to deployment.
Key pre-training techniques include mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling, complemented by post-training using diversified SFT captions and a VLM-based reward model.
A novel acceleration paradigm, employing consistent noise expectation and importance-aware timestep sampling, achieves a 4 to 8 times speedup without quality degradation.
Seedream 3.0 shows enhanced capabilities in complex prompt adherence, fine-grained text rendering, especially for Chinese characters, improved visual quality, and native high-resolution (up to 2K) generation.
Overview:
This paper introduces Trelawney, a technique addressing the limitations of standard causal language model training by rearranging training data sequences to better imitate human goal-oriented generation without requiring architectural changes.
This approach improves performance on several key benchmarks, including planning, algorithmic reasoning, and story generation.
Furthermore, Trelawney naturally enables the generation of long-term goals at no additional cost, which can be leveraged to further enhance planning and reasoning capabilities.
Overview:
This paper investigates predicting optimal pretraining data for LLMs using small-scale experiments, conducting controlled pretraining across diverse corpora, sizes up to 1B parameters, and 100B tokens.
Ranking models at a small scale (150M) predicts relative performance at a larger scale (1B) with high accuracy (~80%), outperforming tested scaling law methods for data selection efficiency.
Furthermore, continuous likelihood metrics from small models accurately predict target-scale performance (>80%) on benchmarks like MMLU and HumanEval using minimal compute (0.01%).
Autoregressive Distillation of Diffusion Transformers
Overview:
This paper introduces AutoRegressive Distillation (ARD) to mitigate exposure bias in diffusion transformer distillation by leveraging the historical ODE trajectory instead of only the most recent sample.
ARD modifies the transformer architecture using token-wise time embeddings and a block-wise causal attention mask, incorporating history mainly in lower layers for efficiency.
On ImageNet-256, ARD achieves a 5x reduction in FID degradation over baselines with minimal extra FLOPs, reaching a low FID score in a few steps.
The approach also shows improved prompt adherence in text-to-image synthesis compared to other distilled models with little FID degradation from the teacher.
Perception Encoder: The best visual embeddings are not at the output of the network
Overview:
Perception Encoder (PE) introduces a vision encoder for image and video understanding trained solely via scaled vision-language contrastive learning.
This work finds that the strongest, most general visual embeddings are located within intermediate network layers, rather than the final output.
Utilizing proposed language and spatial alignment methods to extract these representations, PE achieves state-of-the-art performance across diverse downstream tasks including zero-shot classification, retrieval, Q&A, and dense spatial prediction.
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Overview:
This paper introduces M1, a hybrid Mamba-based linear RNN reasoning model designed to overcome the quadratic scaling limitations of Transformers for complex reasoning tasks requiring long test-time computation.
M1 utilizes distillation from existing models and reinforcement learning for enhanced performance.
On benchmarks like AIME and MATH, M1 matches state-of-the-art distilled Transformer models while achieving over a 3x inference speedup.
This improved throughput allows M1 to attain higher accuracy under fixed time budgets via self-consistency, presenting a more scalable approach for test-time reasoning generation.
Overview:
The d1 framework adapts pre-trained masked diffusion LLMs (dLLMs) for reasoning using masked supervised finetuning and diffu-GRPO, a novel critic-free, policy-gradient reinforcement learning algorithm.
This approach significantly improves mathematical and logical reasoning performance on a state-of-the-art dLLM.
Overview:
Antidistillation sampling counters unwanted model distillation facilitated by the reasoning traces produced by frontier models.
This technique strategically modifies the model's next-token probability distribution during the generation process.
The result poisons these reasoning traces, significantly diminishing their utility for distillation tasks while preserving the original model's practical performance.
- Spurious Rewards
- FLUX.1 Kontext
- Learning to Reason without External Rewards
- Reasoning LLMs are Wandering Solution Explorers
- VLM-3R
- Silence is Not Consensus
- Beyond Markovian
- The Entropy Mechanism of RL for Reasoning LMs
- ATLAS
- Fractured CoT Reasoning
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
- Pixel Reasoner
- Fast-dLLM
- Accelerating Diffusion LM Inference via Efficient KV Caching and Guided Diffusion
- Reinforcing General Reasoning without Verifiers
- Hardware-Efficient Attention for Fast Decoding
- Temporal Sampling for Forgotten Reasoning in LLMs
- Long-Context State-Space Video World Models
- DeepTheorem
- Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
- Estimating the Effects of Sample Training Orders for LLMs without Retraining
- Darwin Godel Machine
- Grounded Reinforcement Learning for Visual Reasoning
overview for each + authors' explanations
read this in thread mode for the best experience
Spurious Rewards: Rethinking Training Signals in RLVR
Overview:
RLVR enables strong mathematical reasoning in Qwen2.5-Math-7B even with spurious rewards, achieving up to 26.5% gains on MATH-500, nearly matching improvements from ground truth rewards.
These gains are driven by increased code reasoning behaviors, rising from 66.7% to over 90%, despite no code execution. However, similar spurious signals fail on models like Llama3 and OLMo2, demonstrating the need to test RLVR across diverse architectures and model families.
This suggests RLVR surfaces latent pretrained reasoning rather than optimizing for reward correctness.
Overview:
FLUX.1 Kontext is a generative flow matching model that unifies image generation and editing by using a sequence concatenation method to handle diverse in-context tasks, showing strong object and character preservation across iterations.
It outperforms existing models in multi-turn consistency and editing robustness, while delivering significantly faster generation speeds.
Evaluated on the new KontextBench benchmark spanning five editing categories, it demonstrates competitive single-turn quality and sets a new standard for unified image workflows.
- Absolute Zero
- RM-R1
- Seed-Coder
- Flow-GRPO
- ZeroSearch
- Ming-Lite-Uni
- A Survey on Large Multimodal Reasoning Models
- On Path to Multimodal Generalist
- ZeroSearch
- HunyuanCustom
- Unified Multimodal CoT Reward Model through Reinforcement Fine-Tuning
- Grokking in the Wild
- Voila
- Llama-Nemotron
- An Empirical Study of Qwen3 Quantization
- DiffVQA
- Scalable Chain of Thoughts via Elastic Reasoning
- R1-Reward
- Pangu Ultra MoE
overview for each + authors' explanations
read this in thread mode for the best experience
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Overview:
Absolute Zero introduces a new RLVR (RL with verifiable rewards) framework where a model autonomously generates and solves tasks to optimize its learning, eliminating the need for human-curated data.
The Absolute Zero Reasoner (AZR) uses a code executor for task validation and answer verification, providing verifiable rewards for open-ended, self-supervised training.
AZR achieves SoTA performance on coding and mathematical reasoning benchmarks despite using no external data, outperforming zero-setting baselines trained on large human-curated datasets. The approach generalizes across model sizes and architectures.
Overview:
This work introduces Reasoning Reward Models (ReasRMs) like RM-R1, which formulate reward modeling as a reasoning task for enhanced LLM alignment, trained via distilling reasoning chains and reinforcement learning with verifiable rewards.
These models self-generate reasoning traces or rubrics for evaluation, achieving SOTA or near-SOTA performance on reward benchmarks and outperforming larger models by up to 13.8%.
- Inference-Time Scaling for Generalist Reward Modeling
- Multi-Token Attention
- Why do LLMs attend to the first token?
- Command A
- LLMs Pass the Turing Test
- Advances and Challenges in Foundation Agents
- PaperBench
- Effectively Controlling Reasoning Models through Thinking Intervention
- TransMamba
- Open-Reasoner-Zero
- Scaling Tool-Integrated RL
- Scaling Language-Free Visual Representation Learning
- Output Constraints as Attack Surface
- Large (Vision) Language Models are Unsupervised In-Context Learners
- Memorizing is Not Enough
- ShortV
- MegaScale-Infer
- What the F*ck Is Artificial General Intelligence?
- Prompting Forgetting
- Enlightenment Period Improving DNN Performance
overview for each + authors' explanations
read this in thread mode for the best experience
Inference-Time Scaling for Generalist Reward Modeling
Overview:
This work from DeepSeek investigates inference-time scalability for generalist reward modeling (RM) in LLMs, utilizing pointwise generative reward modeling (GRM) for flexibility.
It introduces Self-Principled Critique Tuning (SPCT), an online RL method, to train DeepSeek-GRM models that adaptively generate principles and critiques for improved reward accuracy.
To enhance inference-time scaling, the study employs parallel sampling guided by a meta RM, demonstrating significantly improved quality and scalability on various RM benchmarks compared to existing methods and potentially exceeding training-time scaling benefits.
Overview:
Multi-Token Attention (MTA) enhances LLM attention by conditioning weights on multiple query and key vectors simultaneously through convolution operations over queries, keys, and heads.
This method allows for locating relevant context using richer information, leading to enhanced performance over Transformer baselines on language modeling and long-context search tasks.
- Transformers without Normalization
- Block Diffusion
- Compute Optimal Scaling of Skills
- DAPO: An OS LLM RL System at Scale
- Teaching LLMs How to Learn with Contextual Fine-Tuning
- GR00T N1
- Why the Brain Cannot Be a Digital Computer
- RWKV-7 "Goose" with Expressive Dynamic State Evolution
- Why Do Multi-Agent LLM Systems Fail?
- Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
- Light-R1
- Where do Large Vision-Language Models Look at when Answering Questions?
- Improving Planning of Agents for Long-Horizon Tasks
- UniCombine
- How much do LLMs learn from negative examples?
- Tokenize Image as a Set
- Search-R1
- Measuring AI Ability to Complete Long Tasks
- Does Your VLM Get Lost in the Long Video Sampling Dilemma?
- Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
- Personalize Anything for Free with Diffusion Transformer
- The KoLMogorov Test: Compression by Code Generation
- Optimizing ML Training with Metagradient Descent
overview for each + authors' explanations
read this in thread mode for the best experience
Overview:
Transformers can achieve or surpass normalized performance using a simple technique called Dynamic Tanh (DyT), replacing normalization layers with an element-wise operation inspired by tanh-like mappings observed in layer norm, validated across various tasks in computer vision and LLMs.
Overview:
Block diffusion language models, merge discrete denoising diffusion with autoregressive models, addressing fixed-length generation limitations and enhancing inference efficiency via KV caching and parallel token sampling.
This model introduces a training algorithm, gradient variance estimators, and data-driven noise schedules for minimized variance, achieving state-of-the-art results among diffusion models on benchmarks and enabling flexible-length sequence generation.
- LLM Pretraining with Continuous Concepts
- Distillation Scaling Laws
- Can 1B LLM Surpass 405B LLM?
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Emergent Response Planning in LLM
- Improving Existing Optimization Algorithms with LLMs
- Training Language Models for Social Deduction with Multi-Agent RL
- Multi-Head Latent Attention Is All You Need
- Generative Modeling with Bayesian Sample Inference
- Scaling Pre-training to One Hundred Billion Data for Vision Language Models
- NatureLM: Deciphering the Language of Nature for Scientific Discovery
- Competitive Programming with Large Reasoning Models
- Matryoshka Quantization
overview for each + authors' explanations
read this in thread mode for the best experience
Overview:
CoCoMix, a pretraining framework, combines standard next token prediction with continuous concept learning derived from a pretrained sparse autoencoder and mixes these into the model’s hidden state by interleaving with token hidden representations.
This approach improves sample efficiency of LLMs and consistently surpasses next token prediction, knowledge distillation, and pause token insertion across language modeling and reasoning tasks.
The integration of concept learning also enhances model interpretability and steerability.
Overview:
This work introduces a distillation scaling law that predicts student model performance based on compute allocation between the teacher and student, enabling optimal resource distribution.
It provides compute-optimal distillation strategies for cases where a teacher exists or needs training, showing that distillation outperforms supervised pretraining when multiple students are distilled.
However, when training both a teacher and a single student, supervised learning is preferable.
- Demystifying Long Chain-of-Thought Reasoning in LLMs
- OmniHuman-1
- LIMO
- s1: Simple test-time scaling
- Process Reinforcement through Implicit Rewards
- Iterate to Accelerate
- Efficient Reasoning with Hidden Thinking
- Fully Autonomous AI Agents Should Not be Developed
- DeepRAG
- Scalable-Softmax Is Superior for Attention
- The Differences Between Direct Alignment Algorithms are a Blur
- Preference Leakage
- SafeRAG
- Analyze Feature Flow to Enhance Interpretation and Steering in LMs
- Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
- ConceptAttention
- Weak-to-Strong Diffusion with Reflection
- Great Models Think Alike and this Undermines AI Oversight
- SmolLM2
- Inverse Bridge Matching Distillation
- Rethinking Mixture-of-Agents
overview for each + authors' explanations
read this in thread mode for the best experience
Demystifying Long Chain-of-Thought Reasoning in LLMs
Overview:
Long Chain-of-Thought reasoning in LLMs, which enables strategies like backtracking and error correction, is significantly enhanced by scaling inference compute, and the process can be optimized through reinforcement learning.
Although supervised fine-tuning (SFT) simplifies training and improves efficiency, it's the scaling of verifiable reward signals, particularly through noisy, web-extracted solutions, which proves critical for RL especially in out-of-distribution tasks like STEM reasoning.
Basic capabilities such as error correction are initially present, incentivizing these skills for complex tasks via RL necessitates substantial compute, and measuring the development of capabilities needs an elaborate approach.
Overview:
OmniHuman, a Diffusion Transformer-based framework, scales up data by incorporating mixed motion-related conditions during training, achieving highly realistic human video generation across diverse scenarios.
This framework supports varied portrait contents, handles talking, singing, human-object interactions, and diverse image styles, with specialized training principles, model architecture, and inference strategy.
OmniHuman provides greater flexibility in inputs including audio-driven, video-driven, and combined driving signals, surpassing existing end-to-end audio-driven methods in realism and versatility.