🚨This week's top AI/ML research papers:
- BitNet b1.58 2B4T Technical Report
- Reasoning Models Can Be Effective Without Thinking
- ReTool
- Sleep-time Compute
- Nemotron-H
- Kimina-Prover Preview
- CLIMB
- Dynamic Cheatsheet
- How new data permeates LLM knowledge and how to dilute it
- InternVL3
- MIEB
- REPA-E
- Seedream 3.0 Technical Report
- Looking beyond the next token
- DataDecide
- Autoregressive Distillation of Diffusion Transformers
- Perception Encoder
- M1: Mamba Reasoning Models
- d1: Reasoning in Diffusion LLMs
- Antidistillation Sampling
overview for each + authors' explanations
read this in thread mode for the best experience
BitNet b1.58 2B4T Technical Report
Author's Explanation:
x.com/realHongyu_Wan…
Overview:
BitNet b1.58 2B4T is a native 1-bit LLM with 2 billion parameters trained on 4 trillion tokens, matching the performance of comparable full-precision LLMs on tasks like language understanding and reasoning.
This 1-bit architecture demonstrates substantial improvements in computational efficiency, marked by reduced memory footprint, energy usage, and faster decoding latency.
Paper:
arxiv.org/abs/2504.12285
Reasoning Models Can Be Effective Without Thinking
Overview:
This research questions the necessity of explicit "Thinking" steps for LLM reasoning, demonstrating that bypassing this process via simple "NoThinking" prompting is effective.
Controlling for token budget, NoThinking substantially outperforms explicit Thinking on diverse reasoning tasks including mathematical problem solving and coding, showing notable gains in low-budget settings (e.g., 51.3 vs 28.9 on ACM 23).
The paper introduces a parallel scaling approach where multiple independent NoThinking outputs are generated and aggregated using verifiers or best-of-N strategies.
This parallel method achieves better performance than Thinking baselines at similar latency and matches Thinking results requiring significantly more latency (up to 9x).
Paper:
arxiv.org/abs/2504.09858
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Overview:
ReTool enhances LLMs for structured problem-solving by integrating real-time code execution with natural language reasoning through reinforcement learning.
The framework features dynamic interleaving of code and text, employing an automated RL paradigm where the model learns optimal tool invocation strategies from outcome feedback without human priors.
On the challenging MATH Olympiad benchmark AIME, ReTool-32B achieves 67% accuracy, substantially outperforming text-based RL baselines, and reaches 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%.
The approach also leads to emergent behaviors like code self-correction during complex reasoning tasks.
Paper:
arxiv.org/abs/2504.11536
Sleep-time Compute: Beyond Inference Scaling at Test-time
Overview:
Sleep-time compute allows LLMs to perform computations offline by anticipating user queries, aiming to reduce the high latency and cost associated with scaling test-time inference.
On modified reasoning tasks like Stateful GSM-Symbolic and Stateful AIME, this method reduces necessary test-time compute by approximately 5x for equivalent accuracy.
Scaling sleep-time compute further boosts accuracy by up to 18% on these tasks, and amortizing this computation across related queries decreases average cost per query by 2.5x.
The effectiveness of sleep-time compute correlates with the predictability of user queries.
Paper:
arxiv.org/abs/2504.13171
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
Author's Explanation:
x.com/PavloMolchanov…
Overview:
Nemotron-H presents a family of efficient 8B and 56B hybrid Mamba-Transformer models that replace most self-attention with Mamba layers, delivering up to 3x faster inference than comparable state-of-the-art Transformers at similar accuracy levels, alongside a MiniPuzzle compressed 47B variant providing an additional 20% speedup and an FP8 training recipe achieving BF16-parity.
Paper:
arxiv.org/abs/2504.03624
Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning
Author's Explanation:
x.com/JiaLi52524397/…
Overview:
This paper introduces Kimina-Prover Preview, an LLM trained with a large-scale reinforcement learning pipeline from Qwen2.5-72B, pioneering a reasoning-driven exploration paradigm for formal theorem proving in Lean 4.
Employing a novel formal reasoning pattern, the model mimics human problem-solving strategies to achieve state-of-the-art performance on the miniF2F benchmark, reaching 80.7% pass@8192.
Kimina-Prover shows high sample efficiency, delivering strong results with minimal sampling and effective scaling with computational budget.
Furthermore, the work demonstrates clear performance scaling with model size, a trend previously unobserved for neural theorem provers, and its learned reasoning style differs from traditional search algorithms.
Paper:
arxiv.org/abs/2504.11354
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Author's Explanation:
x.com/shizhediao/sta…
Overview:
This paper introduces CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework to discover and refine optimal pre-training data mixtures from unlabeled corpora.
The method embeds and clusters large datasets in a semantic space, then iteratively searches for optimal mixtures using a smaller proxy model and a performance predictor.
Continuous training on 400B tokens with a CLIMB-optimized mixture allows a 1B parameter model to outperform Llama-3.2-1B by 2.0%, while optimizing for a specific domain boosts performance by 5% over random sampling.
Paper:
arxiv.org/abs/2504.13161
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
Author's Explanation:
x.com/james_y_zou/st…
Overview:
Dynamic Cheatsheet (DC) introduces a framework endowing black-box LLMs with persistent, self-curated memory for test-time learning, allowing reuse of strategies and code snippets.
This method substantially improves performance without ground-truth labels, doubling accuracy on AIME math exams and increasing Game of 24 success from 10% to 99% by recalling effective solutions.
Additional gains were demonstrated on knowledge-intensive tasks like GPQA-Diamond (+9%) and MMLU-Pro (+8%), all achieved without modifying the underlying LLM parameters.
Paper:
arxiv.org/abs/2504.07952
How new data permeates LLM knowledge and how to dilute it
Author's Explanation:
x.com/ChenSun92/stat…
Overview:
This research investigates how new information integrates into LLMs, identifying a "priming" effect where learning specific facts leads to their inappropriate application in unrelated contexts.
Using the introduced "Outlandish" dataset, the study demonstrates that the extent of this priming can be predicted by analyzing key token probabilities before learning, a finding consistent across various model architectures and sizes.
Two novel methods, "stepping-stone" text augmentation and "ignore-k" update pruning, are proposed to modulate this knowledge permeation.
These techniques substantially reduce undesirable priming effects by 50-95% while preserving the LLM's ability to learn new information accurately.
Paper:
arxiv.org/abs/2504.09522
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Author's Explanation:
x.com/opengvlab/stat…
Overview:
InternVL3 introduces a native multimodal pre-training paradigm, enabling the joint acquisition of multimodal and linguistic capabilities from diverse data sources within a single stage, circumventing typical MLLM alignment challenges.
This approach incorporates Variable Visual Position Encoding (V2PE) for extended contexts, advanced post-training techniques like SFT and MPO, and test-time scaling strategies.
The InternVL3-78B model achieves a state-of-the-art 72.2 score on the MMMU benchmark among open-source MLLMs, proving highly competitive with leading proprietary models while retaining strong language proficiency.
Paper:
arxiv.org/abs/2504.10479
MIEB: Massive Image Embedding Benchmark
Overview:
This work introduces the Massive Image Embedding Benchmark (MIEB) for comprehensive evaluation of image and image-text models across 130 tasks in 38 languages, grouped into 8 high-level categories.
Benchmarking 50 models reveals no single dominant method, showing strong visual text representation but limited capabilities with interleaved encodings and confounders.
Encoder performance on MIEB correlates highly with their effectiveness in multimodal LLMs.
Paper:
arxiv.org/abs/2504.10471
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers
Author's Explanation:
x.com/1jaskiratsingh…
Overview:
REPA-E enables effective end-to-end training of variational auto-encoders (VAEs) alongside latent diffusion transformers, addressing the limitations of standard diffusion loss for joint optimization.
By employing a representation-alignment (REPA) loss, REPA-E facilitates simultaneous tuning of both the VAE and the diffusion model.
This approach significantly accelerates diffusion model training by over 17x and 45x compared to previous methods and improves the VAE's latent structure, leading to state-of-the-art generation performance on ImageNet.
Paper:
arxiv.org/abs/2504.10483
Seedream 3.0 Technical Report
Overview:
This paper introduces Seedream 3.0, a significantly improved Chinese-English bilingual image foundation model featuring pipeline advancements from data construction using defect-aware training and dual-axis sampling to deployment.
Key pre-training techniques include mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling, complemented by post-training using diversified SFT captions and a VLM-based reward model.
A novel acceleration paradigm, employing consistent noise expectation and importance-aware timestep sampling, achieves a 4 to 8 times speedup without quality degradation.
Seedream 3.0 shows enhanced capabilities in complex prompt adherence, fine-grained text rendering, especially for Chinese characters, improved visual quality, and native high-resolution (up to 2K) generation.
Paper:
arxiv.org/abs/2504.11346
Looking beyond the next token
Overview:
This paper introduces Trelawney, a technique addressing the limitations of standard causal language model training by rearranging training data sequences to better imitate human goal-oriented generation without requiring architectural changes.
This approach improves performance on several key benchmarks, including planning, algorithmic reasoning, and story generation.
Furthermore, Trelawney naturally enables the generation of long-term goals at no additional cost, which can be leveraged to further enhance planning and reasoning capabilities.
Paper:
arxiv.org/abs/2504.11336
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Author's Explanation:
x.com/allen_ai/statu…
Overview:
This paper investigates predicting optimal pretraining data for LLMs using small-scale experiments, conducting controlled pretraining across diverse corpora, sizes up to 1B parameters, and 100B tokens.
Ranking models at a small scale (150M) predicts relative performance at a larger scale (1B) with high accuracy (~80%), outperforming tested scaling law methods for data selection efficiency.
Furthermore, continuous likelihood metrics from small models accurately predict target-scale performance (>80%) on benchmarks like MMLU and HumanEval using minimal compute (0.01%).
Paper:
arxiv.org/abs/2504.11393
Autoregressive Distillation of Diffusion Transformers
Overview:
This paper introduces AutoRegressive Distillation (ARD) to mitigate exposure bias in diffusion transformer distillation by leveraging the historical ODE trajectory instead of only the most recent sample.
ARD modifies the transformer architecture using token-wise time embeddings and a block-wise causal attention mask, incorporating history mainly in lower layers for efficiency.
On ImageNet-256, ARD achieves a 5x reduction in FID degradation over baselines with minimal extra FLOPs, reaching a low FID score in a few steps.
The approach also shows improved prompt adherence in text-to-image synthesis compared to other distilled models with little FID degradation from the teacher.
Paper:
arxiv.org/abs/2504.11295
Perception Encoder: The best visual embeddings are not at the output of the network
Overview:
Perception Encoder (PE) introduces a vision encoder for image and video understanding trained solely via scaled vision-language contrastive learning.
This work finds that the strongest, most general visual embeddings are located within intermediate network layers, rather than the final output.
Utilizing proposed language and spatial alignment methods to extract these representations, PE achieves state-of-the-art performance across diverse downstream tasks including zero-shot classification, retrieval, Q&A, and dense spatial prediction.
Paper:
arxiv.org/abs/2504.13181
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Overview:
This paper introduces M1, a hybrid Mamba-based linear RNN reasoning model designed to overcome the quadratic scaling limitations of Transformers for complex reasoning tasks requiring long test-time computation.
M1 utilizes distillation from existing models and reinforcement learning for enhanced performance.
On benchmarks like AIME and MATH, M1 matches state-of-the-art distilled Transformer models while achieving over a 3x inference speedup.
This improved throughput allows M1 to attain higher accuracy under fixed time budgets via self-consistency, presenting a more scalable approach for test-time reasoning generation.
Paper:
arxiv.org/abs/2504.10449
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Author's Explanation:
x.com/siyan_zhao/sta…
Overview:
The d1 framework adapts pre-trained masked diffusion LLMs (dLLMs) for reasoning using masked supervised finetuning and diffu-GRPO, a novel critic-free, policy-gradient reinforcement learning algorithm.
This approach significantly improves mathematical and logical reasoning performance on a state-of-the-art dLLM.
Paper:
arxiv.org/abs/2504.12216
Antidistillation Sampling
Author's Explanation:
x.com/zicokolter/sta…
Overview:
Antidistillation sampling counters unwanted model distillation facilitated by the reasoning traces produced by frontier models.
This technique strategically modifies the model's next-token probability distribution during the generation process.
The result poisons these reasoning traces, significantly diminishing their utility for distillation tasks while preserving the original model's practical performance.
Paper:
arxiv.org/abs/2504.13146
That's a wrap for last week, thanks for reading!
Remember to drop a follow @TheAITimeline & rt if you like it!
You can see a few in-depth explanation in my next few issues, stay tuned here:
mail.bycloud.ai/subscribe
Have a great start to your week!
x.com/TheAITimeline/…
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.