Overview:
BitNet b1.58 2B4T is a native 1-bit LLM with 2 billion parameters trained on 4 trillion tokens, matching the performance of comparable full-precision LLMs on tasks like language understanding and reasoning.
This 1-bit architecture demonstrates substantial improvements in computational efficiency, marked by reduced memory footprint, energy usage, and faster decoding latency.
Reasoning Models Can Be Effective Without Thinking
Overview:
This research questions the necessity of explicit "Thinking" steps for LLM reasoning, demonstrating that bypassing this process via simple "NoThinking" prompting is effective.
Controlling for token budget, NoThinking substantially outperforms explicit Thinking on diverse reasoning tasks including mathematical problem solving and coding, showing notable gains in low-budget settings (e.g., 51.3 vs 28.9 on ACM 23).
The paper introduces a parallel scaling approach where multiple independent NoThinking outputs are generated and aggregated using verifiers or best-of-N strategies.
This parallel method achieves better performance than Thinking baselines at similar latency and matches Thinking results requiring significantly more latency (up to 9x).
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Overview:
ReTool enhances LLMs for structured problem-solving by integrating real-time code execution with natural language reasoning through reinforcement learning.
The framework features dynamic interleaving of code and text, employing an automated RL paradigm where the model learns optimal tool invocation strategies from outcome feedback without human priors.
On the challenging MATH Olympiad benchmark AIME, ReTool-32B achieves 67% accuracy, substantially outperforming text-based RL baselines, and reaches 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%.
The approach also leads to emergent behaviors like code self-correction during complex reasoning tasks.
Sleep-time Compute: Beyond Inference Scaling at Test-time
Overview:
Sleep-time compute allows LLMs to perform computations offline by anticipating user queries, aiming to reduce the high latency and cost associated with scaling test-time inference.
On modified reasoning tasks like Stateful GSM-Symbolic and Stateful AIME, this method reduces necessary test-time compute by approximately 5x for equivalent accuracy.
Scaling sleep-time compute further boosts accuracy by up to 18% on these tasks, and amortizing this computation across related queries decreases average cost per query by 2.5x.
The effectiveness of sleep-time compute correlates with the predictability of user queries.
Overview:
Nemotron-H presents a family of efficient 8B and 56B hybrid Mamba-Transformer models that replace most self-attention with Mamba layers, delivering up to 3x faster inference than comparable state-of-the-art Transformers at similar accuracy levels, alongside a MiniPuzzle compressed 47B variant providing an additional 20% speedup and an FP8 training recipe achieving BF16-parity.
Overview:
This paper introduces Kimina-Prover Preview, an LLM trained with a large-scale reinforcement learning pipeline from Qwen2.5-72B, pioneering a reasoning-driven exploration paradigm for formal theorem proving in Lean 4.
Employing a novel formal reasoning pattern, the model mimics human problem-solving strategies to achieve state-of-the-art performance on the miniF2F benchmark, reaching 80.7% pass@8192.
Kimina-Prover shows high sample efficiency, delivering strong results with minimal sampling and effective scaling with computational budget.
Furthermore, the work demonstrates clear performance scaling with model size, a trend previously unobserved for neural theorem provers, and its learned reasoning style differs from traditional search algorithms.
Overview:
This paper introduces CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework to discover and refine optimal pre-training data mixtures from unlabeled corpora.
The method embeds and clusters large datasets in a semantic space, then iteratively searches for optimal mixtures using a smaller proxy model and a performance predictor.
Continuous training on 400B tokens with a CLIMB-optimized mixture allows a 1B parameter model to outperform Llama-3.2-1B by 2.0%, while optimizing for a specific domain boosts performance by 5% over random sampling.
Overview:
Dynamic Cheatsheet (DC) introduces a framework endowing black-box LLMs with persistent, self-curated memory for test-time learning, allowing reuse of strategies and code snippets.
This method substantially improves performance without ground-truth labels, doubling accuracy on AIME math exams and increasing Game of 24 success from 10% to 99% by recalling effective solutions.
Additional gains were demonstrated on knowledge-intensive tasks like GPQA-Diamond (+9%) and MMLU-Pro (+8%), all achieved without modifying the underlying LLM parameters.
Overview:
This research investigates how new information integrates into LLMs, identifying a "priming" effect where learning specific facts leads to their inappropriate application in unrelated contexts.
Using the introduced "Outlandish" dataset, the study demonstrates that the extent of this priming can be predicted by analyzing key token probabilities before learning, a finding consistent across various model architectures and sizes.
Two novel methods, "stepping-stone" text augmentation and "ignore-k" update pruning, are proposed to modulate this knowledge permeation.
These techniques substantially reduce undesirable priming effects by 50-95% while preserving the LLM's ability to learn new information accurately.
Overview:
InternVL3 introduces a native multimodal pre-training paradigm, enabling the joint acquisition of multimodal and linguistic capabilities from diverse data sources within a single stage, circumventing typical MLLM alignment challenges.
This approach incorporates Variable Visual Position Encoding (V2PE) for extended contexts, advanced post-training techniques like SFT and MPO, and test-time scaling strategies.
The InternVL3-78B model achieves a state-of-the-art 72.2 score on the MMMU benchmark among open-source MLLMs, proving highly competitive with leading proprietary models while retaining strong language proficiency.
Overview:
This work introduces the Massive Image Embedding Benchmark (MIEB) for comprehensive evaluation of image and image-text models across 130 tasks in 38 languages, grouped into 8 high-level categories.
Benchmarking 50 models reveals no single dominant method, showing strong visual text representation but limited capabilities with interleaved encodings and confounders.
Encoder performance on MIEB correlates highly with their effectiveness in multimodal LLMs.
Overview:
REPA-E enables effective end-to-end training of variational auto-encoders (VAEs) alongside latent diffusion transformers, addressing the limitations of standard diffusion loss for joint optimization.
By employing a representation-alignment (REPA) loss, REPA-E facilitates simultaneous tuning of both the VAE and the diffusion model.
This approach significantly accelerates diffusion model training by over 17x and 45x compared to previous methods and improves the VAE's latent structure, leading to state-of-the-art generation performance on ImageNet.
Overview:
This paper introduces Seedream 3.0, a significantly improved Chinese-English bilingual image foundation model featuring pipeline advancements from data construction using defect-aware training and dual-axis sampling to deployment.
Key pre-training techniques include mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling, complemented by post-training using diversified SFT captions and a VLM-based reward model.
A novel acceleration paradigm, employing consistent noise expectation and importance-aware timestep sampling, achieves a 4 to 8 times speedup without quality degradation.
Seedream 3.0 shows enhanced capabilities in complex prompt adherence, fine-grained text rendering, especially for Chinese characters, improved visual quality, and native high-resolution (up to 2K) generation.
Overview:
This paper introduces Trelawney, a technique addressing the limitations of standard causal language model training by rearranging training data sequences to better imitate human goal-oriented generation without requiring architectural changes.
This approach improves performance on several key benchmarks, including planning, algorithmic reasoning, and story generation.
Furthermore, Trelawney naturally enables the generation of long-term goals at no additional cost, which can be leveraged to further enhance planning and reasoning capabilities.
Overview:
This paper investigates predicting optimal pretraining data for LLMs using small-scale experiments, conducting controlled pretraining across diverse corpora, sizes up to 1B parameters, and 100B tokens.
Ranking models at a small scale (150M) predicts relative performance at a larger scale (1B) with high accuracy (~80%), outperforming tested scaling law methods for data selection efficiency.
Furthermore, continuous likelihood metrics from small models accurately predict target-scale performance (>80%) on benchmarks like MMLU and HumanEval using minimal compute (0.01%).
Autoregressive Distillation of Diffusion Transformers
Overview:
This paper introduces AutoRegressive Distillation (ARD) to mitigate exposure bias in diffusion transformer distillation by leveraging the historical ODE trajectory instead of only the most recent sample.
ARD modifies the transformer architecture using token-wise time embeddings and a block-wise causal attention mask, incorporating history mainly in lower layers for efficiency.
On ImageNet-256, ARD achieves a 5x reduction in FID degradation over baselines with minimal extra FLOPs, reaching a low FID score in a few steps.
The approach also shows improved prompt adherence in text-to-image synthesis compared to other distilled models with little FID degradation from the teacher.
Perception Encoder: The best visual embeddings are not at the output of the network
Overview:
Perception Encoder (PE) introduces a vision encoder for image and video understanding trained solely via scaled vision-language contrastive learning.
This work finds that the strongest, most general visual embeddings are located within intermediate network layers, rather than the final output.
Utilizing proposed language and spatial alignment methods to extract these representations, PE achieves state-of-the-art performance across diverse downstream tasks including zero-shot classification, retrieval, Q&A, and dense spatial prediction.
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Overview:
This paper introduces M1, a hybrid Mamba-based linear RNN reasoning model designed to overcome the quadratic scaling limitations of Transformers for complex reasoning tasks requiring long test-time computation.
M1 utilizes distillation from existing models and reinforcement learning for enhanced performance.
On benchmarks like AIME and MATH, M1 matches state-of-the-art distilled Transformer models while achieving over a 3x inference speedup.
This improved throughput allows M1 to attain higher accuracy under fixed time budgets via self-consistency, presenting a more scalable approach for test-time reasoning generation.
Overview:
The d1 framework adapts pre-trained masked diffusion LLMs (dLLMs) for reasoning using masked supervised finetuning and diffu-GRPO, a novel critic-free, policy-gradient reinforcement learning algorithm.
This approach significantly improves mathematical and logical reasoning performance on a state-of-the-art dLLM.
Overview:
Antidistillation sampling counters unwanted model distillation facilitated by the reasoning traces produced by frontier models.
This technique strategically modifies the model's next-token probability distribution during the generation process.
The result poisons these reasoning traces, significantly diminishing their utility for distillation tasks while preserving the original model's practical performance.
- Inference-Time Scaling for Generalist Reward Modeling
- Multi-Token Attention
- Why do LLMs attend to the first token?
- Command A
- LLMs Pass the Turing Test
- Advances and Challenges in Foundation Agents
- PaperBench
- Effectively Controlling Reasoning Models through Thinking Intervention
- TransMamba
- Open-Reasoner-Zero
- Scaling Tool-Integrated RL
- Scaling Language-Free Visual Representation Learning
- Output Constraints as Attack Surface
- Large (Vision) Language Models are Unsupervised In-Context Learners
- Memorizing is Not Enough
- ShortV
- MegaScale-Infer
- What the F*ck Is Artificial General Intelligence?
- Prompting Forgetting
- Enlightenment Period Improving DNN Performance
overview for each + authors' explanations
read this in thread mode for the best experience
Inference-Time Scaling for Generalist Reward Modeling
Overview:
This work from DeepSeek investigates inference-time scalability for generalist reward modeling (RM) in LLMs, utilizing pointwise generative reward modeling (GRM) for flexibility.
It introduces Self-Principled Critique Tuning (SPCT), an online RL method, to train DeepSeek-GRM models that adaptively generate principles and critiques for improved reward accuracy.
To enhance inference-time scaling, the study employs parallel sampling guided by a meta RM, demonstrating significantly improved quality and scalability on various RM benchmarks compared to existing methods and potentially exceeding training-time scaling benefits.
Overview:
Multi-Token Attention (MTA) enhances LLM attention by conditioning weights on multiple query and key vectors simultaneously through convolution operations over queries, keys, and heads.
This method allows for locating relevant context using richer information, leading to enhanced performance over Transformer baselines on language modeling and long-context search tasks.
- Transformers without Normalization
- Block Diffusion
- Compute Optimal Scaling of Skills
- DAPO: An OS LLM RL System at Scale
- Teaching LLMs How to Learn with Contextual Fine-Tuning
- GR00T N1
- Why the Brain Cannot Be a Digital Computer
- RWKV-7 "Goose" with Expressive Dynamic State Evolution
- Why Do Multi-Agent LLM Systems Fail?
- Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
- Light-R1
- Where do Large Vision-Language Models Look at when Answering Questions?
- Improving Planning of Agents for Long-Horizon Tasks
- UniCombine
- How much do LLMs learn from negative examples?
- Tokenize Image as a Set
- Search-R1
- Measuring AI Ability to Complete Long Tasks
- Does Your VLM Get Lost in the Long Video Sampling Dilemma?
- Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
- Personalize Anything for Free with Diffusion Transformer
- The KoLMogorov Test: Compression by Code Generation
- Optimizing ML Training with Metagradient Descent
overview for each + authors' explanations
read this in thread mode for the best experience
Overview:
Transformers can achieve or surpass normalized performance using a simple technique called Dynamic Tanh (DyT), replacing normalization layers with an element-wise operation inspired by tanh-like mappings observed in layer norm, validated across various tasks in computer vision and LLMs.
Overview:
Block diffusion language models, merge discrete denoising diffusion with autoregressive models, addressing fixed-length generation limitations and enhancing inference efficiency via KV caching and parallel token sampling.
This model introduces a training algorithm, gradient variance estimators, and data-driven noise schedules for minimized variance, achieving state-of-the-art results among diffusion models on benchmarks and enabling flexible-length sequence generation.
- LLM Pretraining with Continuous Concepts
- Distillation Scaling Laws
- Can 1B LLM Surpass 405B LLM?
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Emergent Response Planning in LLM
- Improving Existing Optimization Algorithms with LLMs
- Training Language Models for Social Deduction with Multi-Agent RL
- Multi-Head Latent Attention Is All You Need
- Generative Modeling with Bayesian Sample Inference
- Scaling Pre-training to One Hundred Billion Data for Vision Language Models
- NatureLM: Deciphering the Language of Nature for Scientific Discovery
- Competitive Programming with Large Reasoning Models
- Matryoshka Quantization
overview for each + authors' explanations
read this in thread mode for the best experience
Overview:
CoCoMix, a pretraining framework, combines standard next token prediction with continuous concept learning derived from a pretrained sparse autoencoder and mixes these into the model’s hidden state by interleaving with token hidden representations.
This approach improves sample efficiency of LLMs and consistently surpasses next token prediction, knowledge distillation, and pause token insertion across language modeling and reasoning tasks.
The integration of concept learning also enhances model interpretability and steerability.
Overview:
This work introduces a distillation scaling law that predicts student model performance based on compute allocation between the teacher and student, enabling optimal resource distribution.
It provides compute-optimal distillation strategies for cases where a teacher exists or needs training, showing that distillation outperforms supervised pretraining when multiple students are distilled.
However, when training both a teacher and a single student, supervised learning is preferable.
- Demystifying Long Chain-of-Thought Reasoning in LLMs
- OmniHuman-1
- LIMO
- s1: Simple test-time scaling
- Process Reinforcement through Implicit Rewards
- Iterate to Accelerate
- Efficient Reasoning with Hidden Thinking
- Fully Autonomous AI Agents Should Not be Developed
- DeepRAG
- Scalable-Softmax Is Superior for Attention
- The Differences Between Direct Alignment Algorithms are a Blur
- Preference Leakage
- SafeRAG
- Analyze Feature Flow to Enhance Interpretation and Steering in LMs
- Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
- ConceptAttention
- Weak-to-Strong Diffusion with Reflection
- Great Models Think Alike and this Undermines AI Oversight
- SmolLM2
- Inverse Bridge Matching Distillation
- Rethinking Mixture-of-Agents
overview for each + authors' explanations
read this in thread mode for the best experience
Demystifying Long Chain-of-Thought Reasoning in LLMs
Overview:
Long Chain-of-Thought reasoning in LLMs, which enables strategies like backtracking and error correction, is significantly enhanced by scaling inference compute, and the process can be optimized through reinforcement learning.
Although supervised fine-tuning (SFT) simplifies training and improves efficiency, it's the scaling of verifiable reward signals, particularly through noisy, web-extracted solutions, which proves critical for RL especially in out-of-distribution tasks like STEM reasoning.
Basic capabilities such as error correction are initially present, incentivizing these skills for complex tasks via RL necessitates substantial compute, and measuring the development of capabilities needs an elaborate approach.
Overview:
OmniHuman, a Diffusion Transformer-based framework, scales up data by incorporating mixed motion-related conditions during training, achieving highly realistic human video generation across diverse scenarios.
This framework supports varied portrait contents, handles talking, singing, human-object interactions, and diverse image styles, with specialized training principles, model architecture, and inference strategy.
OmniHuman provides greater flexibility in inputs including audio-driven, video-driven, and combined driving signals, surpassing existing end-to-end audio-driven methods in realism and versatility.
> Do generative video models learn physical principles from watching videos?
> Transformer^2: Self-adaptive LLMs
> MiniMax-01
> The Lessons of Developing Process Reward Models in Mathematical Reasoning
> Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
> Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
> Critical Tokens Matter
> Distilling Multi-modal Large Language Models for Autonomous Driving
> OmniThink
> Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
> MangaNinja
> Diffusion Adversarial Post-Training for One-Step Video Generation
overview for each + authors' explanations
read this in thread mode for the best experience
Do generative video models learn physical principles from watching videos?
Overview:
This paper investigates whether generative video models learn physical principles by introducing Physics-IQ, a benchmark requiring an understanding of physics like fluid dynamics and magnetism.
So although some models like Sora and VideoPoet can solve specific test cases, their overall physical understanding is limited, suggesting that visual realism does not equate to comprehension of physical laws.
Overview:
Transformer^2 introduces a self-adaptive framework for LLMs, adapting them dynamically to unseen tasks by adjusting singular components of weight matrices, using a two-pass mechanism for task identification and expert vector mixing, outperforming methods like LoRA in parameter efficiency and versatility across various architectures and tasks.
- OpenAI o1 System Card
- PaliGemma 2
- HunyuanVideo
- Densing Law of LLMs
- DeMo: Decoupled Momentum Optimization
- o1-Coder
- Reverse Thinking Makes LLMs Stronger Reasoners
- Efficient Track Anything
- NVILA: Efficient Frontier VLMs
- Agent Skill Acquisition for LLMs via CycleQD
- A Noise is Worth Diffusion Guidance
- VisionZip: Longer is Better but Not Necessary in VLMs
- Infinity: Scaling Bitwise AutoRegressive Modeling for High-Res Image Synthesis
- Evaluating Language Models as Synthetic Data Generators
- Critical Tokens Matter
- SNOOPI
- TokenFlow
- MALT: Improving Reasoning with Multi-Agent LLM Training
- X-Prompt
- Video Depth without Video Models
- GRAPE: Generalizing Robot Policy via Preference Alignment
- Beyond Examples
- Scaling Transformers for Low-Bitrate High-Quality Speech Coding
- Retrieval-Augmented Reasoning Enhancement for LLMs
- Best-of-N Jailbreaking
- Composition of Experts
- Mind the Gap: Examining the Self-Improvement Capabilities of LLMs
- Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
- Large-Scale T2I Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
- JetFormer
- Proactive Agent
- Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
- Distillation-Based NAS for Inference-Optimized LLMs
- Navigation World Models
overview for each + authors' explanations
read this in thread mode for the best experience
OpenAI is currently hosting an event called 12 Days of OpenAI. openai.com/12-days/
On day 1, they released the full version of OpenAI-o1. Along with a $200/mo tier with uncapped usage.
About the technical report:
The o1 model series employs large-scale reinforcement learning with chain-of-thought reasoning, achieving state-of-the-art safety performance by mitigating risks like generating illicit advice, choosing biased responses, and resisting jailbreaks.
By reasoning about safety policies in context, these models enhance robustness but also introduce risks tied to advanced intelligence, emphasizing the need for rigorous alignment, stress testing, and risk management.
Safety evaluations, external red teaming, and Preparedness Framework assessments are the main topics of the report's analysis.
Overview:
PaliGemma 2 enhances the versatility of Vision-Language Models by integrating the SigLIP-So400m vision encoder with Gemma 2 models of various sizes, trained across multiple resolutions for improved transfer learning, achieving state-of-the-art performance on expanded tasks including OCR, molecular recognition, and radiography report generation.