The AI Timeline Profile picture
Apr 21 22 tweets 18 min read Read on X
🚨This week's top AI/ML research papers:

- BitNet b1.58 2B4T Technical Report
- Reasoning Models Can Be Effective Without Thinking
- ReTool
- Sleep-time Compute
- Nemotron-H
- Kimina-Prover Preview
- CLIMB
- Dynamic Cheatsheet
- How new data permeates LLM knowledge and how to dilute it
- InternVL3
- MIEB
- REPA-E
- Seedream 3.0 Technical Report
- Looking beyond the next token
- DataDecide
- Autoregressive Distillation of Diffusion Transformers
- Perception Encoder
- M1: Mamba Reasoning Models
- d1: Reasoning in Diffusion LLMs
- Antidistillation Sampling

overview for each + authors' explanations
read this in thread mode for the best experienceImage
BitNet b1.58 2B4T Technical Report

Author's Explanation:
x.com/realHongyu_Wan…

Overview:
BitNet b1.58 2B4T is a native 1-bit LLM with 2 billion parameters trained on 4 trillion tokens, matching the performance of comparable full-precision LLMs on tasks like language understanding and reasoning.

This 1-bit architecture demonstrates substantial improvements in computational efficiency, marked by reduced memory footprint, energy usage, and faster decoding latency.

Paper:
arxiv.org/abs/2504.12285Image
Reasoning Models Can Be Effective Without Thinking

Overview:
This research questions the necessity of explicit "Thinking" steps for LLM reasoning, demonstrating that bypassing this process via simple "NoThinking" prompting is effective.

Controlling for token budget, NoThinking substantially outperforms explicit Thinking on diverse reasoning tasks including mathematical problem solving and coding, showing notable gains in low-budget settings (e.g., 51.3 vs 28.9 on ACM 23).

The paper introduces a parallel scaling approach where multiple independent NoThinking outputs are generated and aggregated using verifiers or best-of-N strategies.

This parallel method achieves better performance than Thinking baselines at similar latency and matches Thinking results requiring significantly more latency (up to 9x).

Paper:
arxiv.org/abs/2504.09858Image
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Overview:
ReTool enhances LLMs for structured problem-solving by integrating real-time code execution with natural language reasoning through reinforcement learning.

The framework features dynamic interleaving of code and text, employing an automated RL paradigm where the model learns optimal tool invocation strategies from outcome feedback without human priors.

On the challenging MATH Olympiad benchmark AIME, ReTool-32B achieves 67% accuracy, substantially outperforming text-based RL baselines, and reaches 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%.

The approach also leads to emergent behaviors like code self-correction during complex reasoning tasks.

Paper:
arxiv.org/abs/2504.11536Image
Sleep-time Compute: Beyond Inference Scaling at Test-time

Overview:
Sleep-time compute allows LLMs to perform computations offline by anticipating user queries, aiming to reduce the high latency and cost associated with scaling test-time inference.

On modified reasoning tasks like Stateful GSM-Symbolic and Stateful AIME, this method reduces necessary test-time compute by approximately 5x for equivalent accuracy.

Scaling sleep-time compute further boosts accuracy by up to 18% on these tasks, and amortizing this computation across related queries decreases average cost per query by 2.5x.

The effectiveness of sleep-time compute correlates with the predictability of user queries.

Paper:
arxiv.org/abs/2504.13171Image
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

Author's Explanation:
x.com/PavloMolchanov…

Overview:
Nemotron-H presents a family of efficient 8B and 56B hybrid Mamba-Transformer models that replace most self-attention with Mamba layers, delivering up to 3x faster inference than comparable state-of-the-art Transformers at similar accuracy levels, alongside a MiniPuzzle compressed 47B variant providing an additional 20% speedup and an FP8 training recipe achieving BF16-parity.

Paper:
arxiv.org/abs/2504.03624Image
Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Author's Explanation:
x.com/JiaLi52524397/…

Overview:
This paper introduces Kimina-Prover Preview, an LLM trained with a large-scale reinforcement learning pipeline from Qwen2.5-72B, pioneering a reasoning-driven exploration paradigm for formal theorem proving in Lean 4.

Employing a novel formal reasoning pattern, the model mimics human problem-solving strategies to achieve state-of-the-art performance on the miniF2F benchmark, reaching 80.7% pass@8192.

Kimina-Prover shows high sample efficiency, delivering strong results with minimal sampling and effective scaling with computational budget.

Furthermore, the work demonstrates clear performance scaling with model size, a trend previously unobserved for neural theorem provers, and its learned reasoning style differs from traditional search algorithms.

Paper:
arxiv.org/abs/2504.11354Image
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Author's Explanation:
x.com/shizhediao/sta…

Overview:
This paper introduces CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework to discover and refine optimal pre-training data mixtures from unlabeled corpora.

The method embeds and clusters large datasets in a semantic space, then iteratively searches for optimal mixtures using a smaller proxy model and a performance predictor.

Continuous training on 400B tokens with a CLIMB-optimized mixture allows a 1B parameter model to outperform Llama-3.2-1B by 2.0%, while optimizing for a specific domain boosts performance by 5% over random sampling.

Paper:
arxiv.org/abs/2504.13161Image
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

Author's Explanation:
x.com/james_y_zou/st…

Overview:
Dynamic Cheatsheet (DC) introduces a framework endowing black-box LLMs with persistent, self-curated memory for test-time learning, allowing reuse of strategies and code snippets.

This method substantially improves performance without ground-truth labels, doubling accuracy on AIME math exams and increasing Game of 24 success from 10% to 99% by recalling effective solutions.

Additional gains were demonstrated on knowledge-intensive tasks like GPQA-Diamond (+9%) and MMLU-Pro (+8%), all achieved without modifying the underlying LLM parameters.

Paper:
arxiv.org/abs/2504.07952Image
How new data permeates LLM knowledge and how to dilute it

Author's Explanation:
x.com/ChenSun92/stat…

Overview:
This research investigates how new information integrates into LLMs, identifying a "priming" effect where learning specific facts leads to their inappropriate application in unrelated contexts.

Using the introduced "Outlandish" dataset, the study demonstrates that the extent of this priming can be predicted by analyzing key token probabilities before learning, a finding consistent across various model architectures and sizes.

Two novel methods, "stepping-stone" text augmentation and "ignore-k" update pruning, are proposed to modulate this knowledge permeation.

These techniques substantially reduce undesirable priming effects by 50-95% while preserving the LLM's ability to learn new information accurately.

Paper:
arxiv.org/abs/2504.09522Image
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Author's Explanation:
x.com/opengvlab/stat…

Overview:
InternVL3 introduces a native multimodal pre-training paradigm, enabling the joint acquisition of multimodal and linguistic capabilities from diverse data sources within a single stage, circumventing typical MLLM alignment challenges.

This approach incorporates Variable Visual Position Encoding (V2PE) for extended contexts, advanced post-training techniques like SFT and MPO, and test-time scaling strategies.

The InternVL3-78B model achieves a state-of-the-art 72.2 score on the MMMU benchmark among open-source MLLMs, proving highly competitive with leading proprietary models while retaining strong language proficiency.

Paper:
arxiv.org/abs/2504.10479Image
MIEB: Massive Image Embedding Benchmark

Overview:
This work introduces the Massive Image Embedding Benchmark (MIEB) for comprehensive evaluation of image and image-text models across 130 tasks in 38 languages, grouped into 8 high-level categories.

Benchmarking 50 models reveals no single dominant method, showing strong visual text representation but limited capabilities with interleaved encodings and confounders.

Encoder performance on MIEB correlates highly with their effectiveness in multimodal LLMs.

Paper:
arxiv.org/abs/2504.10471Image
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

Author's Explanation:
x.com/1jaskiratsingh…

Overview:
REPA-E enables effective end-to-end training of variational auto-encoders (VAEs) alongside latent diffusion transformers, addressing the limitations of standard diffusion loss for joint optimization.

By employing a representation-alignment (REPA) loss, REPA-E facilitates simultaneous tuning of both the VAE and the diffusion model.

This approach significantly accelerates diffusion model training by over 17x and 45x compared to previous methods and improves the VAE's latent structure, leading to state-of-the-art generation performance on ImageNet.

Paper:
arxiv.org/abs/2504.10483Image
Seedream 3.0 Technical Report

Overview:
This paper introduces Seedream 3.0, a significantly improved Chinese-English bilingual image foundation model featuring pipeline advancements from data construction using defect-aware training and dual-axis sampling to deployment.

Key pre-training techniques include mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling, complemented by post-training using diversified SFT captions and a VLM-based reward model.

A novel acceleration paradigm, employing consistent noise expectation and importance-aware timestep sampling, achieves a 4 to 8 times speedup without quality degradation.

Seedream 3.0 shows enhanced capabilities in complex prompt adherence, fine-grained text rendering, especially for Chinese characters, improved visual quality, and native high-resolution (up to 2K) generation.

Paper:
arxiv.org/abs/2504.11346Image
Looking beyond the next token

Overview:
This paper introduces Trelawney, a technique addressing the limitations of standard causal language model training by rearranging training data sequences to better imitate human goal-oriented generation without requiring architectural changes.

This approach improves performance on several key benchmarks, including planning, algorithmic reasoning, and story generation.

Furthermore, Trelawney naturally enables the generation of long-term goals at no additional cost, which can be leveraged to further enhance planning and reasoning capabilities.

Paper:
arxiv.org/abs/2504.11336Image
DataDecide: How to Predict Best Pretraining Data with Small Experiments

Author's Explanation:
x.com/allen_ai/statu…

Overview:
This paper investigates predicting optimal pretraining data for LLMs using small-scale experiments, conducting controlled pretraining across diverse corpora, sizes up to 1B parameters, and 100B tokens.

Ranking models at a small scale (150M) predicts relative performance at a larger scale (1B) with high accuracy (~80%), outperforming tested scaling law methods for data selection efficiency.

Furthermore, continuous likelihood metrics from small models accurately predict target-scale performance (>80%) on benchmarks like MMLU and HumanEval using minimal compute (0.01%).

Paper:
arxiv.org/abs/2504.11393Image
Autoregressive Distillation of Diffusion Transformers

Overview:
This paper introduces AutoRegressive Distillation (ARD) to mitigate exposure bias in diffusion transformer distillation by leveraging the historical ODE trajectory instead of only the most recent sample.

ARD modifies the transformer architecture using token-wise time embeddings and a block-wise causal attention mask, incorporating history mainly in lower layers for efficiency.

On ImageNet-256, ARD achieves a 5x reduction in FID degradation over baselines with minimal extra FLOPs, reaching a low FID score in a few steps.

The approach also shows improved prompt adherence in text-to-image synthesis compared to other distilled models with little FID degradation from the teacher.

Paper:
arxiv.org/abs/2504.11295Image
Perception Encoder: The best visual embeddings are not at the output of the network

Overview:
Perception Encoder (PE) introduces a vision encoder for image and video understanding trained solely via scaled vision-language contrastive learning.

This work finds that the strongest, most general visual embeddings are located within intermediate network layers, rather than the final output.

Utilizing proposed language and spatial alignment methods to extract these representations, PE achieves state-of-the-art performance across diverse downstream tasks including zero-shot classification, retrieval, Q&A, and dense spatial prediction.

Paper:
arxiv.org/abs/2504.13181Image
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

Overview:
This paper introduces M1, a hybrid Mamba-based linear RNN reasoning model designed to overcome the quadratic scaling limitations of Transformers for complex reasoning tasks requiring long test-time computation.

M1 utilizes distillation from existing models and reinforcement learning for enhanced performance.

On benchmarks like AIME and MATH, M1 matches state-of-the-art distilled Transformer models while achieving over a 3x inference speedup.

This improved throughput allows M1 to attain higher accuracy under fixed time budgets via self-consistency, presenting a more scalable approach for test-time reasoning generation.

Paper:
arxiv.org/abs/2504.10449Image
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

Author's Explanation:
x.com/siyan_zhao/sta…

Overview:
The d1 framework adapts pre-trained masked diffusion LLMs (dLLMs) for reasoning using masked supervised finetuning and diffu-GRPO, a novel critic-free, policy-gradient reinforcement learning algorithm.

This approach significantly improves mathematical and logical reasoning performance on a state-of-the-art dLLM.

Paper:
arxiv.org/abs/2504.12216Image
Antidistillation Sampling

Author's Explanation:
x.com/zicokolter/sta…

Overview:
Antidistillation sampling counters unwanted model distillation facilitated by the reasoning traces produced by frontier models.

This technique strategically modifies the model's next-token probability distribution during the generation process.

The result poisons these reasoning traces, significantly diminishing their utility for distillation tasks while preserving the original model's practical performance.

Paper:
arxiv.org/abs/2504.13146Image
That's a wrap for last week, thanks for reading!

Remember to drop a follow @TheAITimeline & rt if you like it!

You can see a few in-depth explanation in my next few issues, stay tuned here:
mail.bycloud.ai/subscribe

Have a great start to your week!
x.com/TheAITimeline/…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with The AI Timeline

The AI Timeline Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @TheAITimeline

Jun 1
🚨This week's top AI/ML research papers:

- Spurious Rewards
- FLUX.1 Kontext
- Learning to Reason without External Rewards
- Reasoning LLMs are Wandering Solution Explorers
- VLM-3R
- Silence is Not Consensus
- Beyond Markovian
- The Entropy Mechanism of RL for Reasoning LMs
- ATLAS
- Fractured CoT Reasoning
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
- Pixel Reasoner
- Fast-dLLM
- Accelerating Diffusion LM Inference via Efficient KV Caching and Guided Diffusion
- Reinforcing General Reasoning without Verifiers
- Hardware-Efficient Attention for Fast Decoding
- Temporal Sampling for Forgotten Reasoning in LLMs
- Long-Context State-Space Video World Models
- DeepTheorem
- Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
- Estimating the Effects of Sample Training Orders for LLMs without Retraining
- Darwin Godel Machine
- Grounded Reinforcement Learning for Visual Reasoning

overview for each + authors' explanations
read this in thread mode for the best experienceImage
Spurious Rewards: Rethinking Training Signals in RLVR

Author's Explanation:
x.com/StellaLisy/sta…

Overview:
RLVR enables strong mathematical reasoning in Qwen2.5-Math-7B even with spurious rewards, achieving up to 26.5% gains on MATH-500, nearly matching improvements from ground truth rewards.

These gains are driven by increased code reasoning behaviors, rising from 66.7% to over 90%, despite no code execution. However, similar spurious signals fail on models like Llama3 and OLMo2, demonstrating the need to test RLVR across diverse architectures and model families.

This suggests RLVR surfaces latent pretrained reasoning rather than optimizing for reward correctness.

Paper:
github.com/ruixin31/Rethi…Image
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Author's Explanation:
x.com/bfl_ml/status/…

Overview:
FLUX.1 Kontext is a generative flow matching model that unifies image generation and editing by using a sequence concatenation method to handle diverse in-context tasks, showing strong object and character preservation across iterations.

It outperforms existing models in multi-turn consistency and editing robustness, while delivering significantly faster generation speeds.

Evaluated on the new KontextBench benchmark spanning five editing categories, it demonstrates competitive single-turn quality and sets a new standard for unified image workflows.

Paper:
cdn.sanity.io/files/gsvmb6gz…Image
Read 26 tweets
May 11
🚨This week's top AI/ML research papers:

- Absolute Zero
- RM-R1
- Seed-Coder
- Flow-GRPO
- ZeroSearch
- Ming-Lite-Uni
- A Survey on Large Multimodal Reasoning Models
- On Path to Multimodal Generalist
- ZeroSearch
- HunyuanCustom
- Unified Multimodal CoT Reward Model through Reinforcement Fine-Tuning
- Grokking in the Wild
- Voila
- Llama-Nemotron
- An Empirical Study of Qwen3 Quantization
- DiffVQA
- Scalable Chain of Thoughts via Elastic Reasoning
- R1-Reward
- Pangu Ultra MoE

overview for each + authors' explanations
read this in thread mode for the best experienceImage
Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Author's Explanation:
x.com/AndrewZ4573249…

Overview:
Absolute Zero introduces a new RLVR (RL with verifiable rewards) framework where a model autonomously generates and solves tasks to optimize its learning, eliminating the need for human-curated data.

The Absolute Zero Reasoner (AZR) uses a code executor for task validation and answer verification, providing verifiable rewards for open-ended, self-supervised training.

AZR achieves SoTA performance on coding and mathematical reasoning benchmarks despite using no external data, outperforming zero-setting baselines trained on large human-curated datasets. The approach generalizes across model sizes and architectures.

Paper:
arxiv.org/abs/2505.03335Image
RM-R1: Reward Modeling as Reasoning

Author's Explanation:
x.com/xiusi_chen/sta…

Overview:
This work introduces Reasoning Reward Models (ReasRMs) like RM-R1, which formulate reward modeling as a reasoning task for enhanced LLM alignment, trained via distilling reasoning chains and reinforcement learning with verifiable rewards.

These models self-generate reasoning traces or rubrics for evaluation, achieving SOTA or near-SOTA performance on reward benchmarks and outperforming larger models by up to 13.8%.

Paper:
arxiv.org/abs/2505.02387Image
Read 21 tweets
Apr 6
🚨This week's top AI/ML research papers:

- Inference-Time Scaling for Generalist Reward Modeling
- Multi-Token Attention
- Why do LLMs attend to the first token?
- Command A
- LLMs Pass the Turing Test
- Advances and Challenges in Foundation Agents
- PaperBench
- Effectively Controlling Reasoning Models through Thinking Intervention
- TransMamba
- Open-Reasoner-Zero
- Scaling Tool-Integrated RL
- Scaling Language-Free Visual Representation Learning
- Output Constraints as Attack Surface
- Large (Vision) Language Models are Unsupervised In-Context Learners
- Memorizing is Not Enough
- ShortV
- MegaScale-Infer
- What the F*ck Is Artificial General Intelligence?
- Prompting Forgetting
- Enlightenment Period Improving DNN Performance

overview for each + authors' explanations
read this in thread mode for the best experienceImage
Inference-Time Scaling for Generalist Reward Modeling

Author's Explanation:
x.com/tuzhaopeng/sta…

Overview:
This work from DeepSeek investigates inference-time scalability for generalist reward modeling (RM) in LLMs, utilizing pointwise generative reward modeling (GRM) for flexibility.

It introduces Self-Principled Critique Tuning (SPCT), an online RL method, to train DeepSeek-GRM models that adaptively generate principles and critiques for improved reward accuracy.

To enhance inference-time scaling, the study employs parallel sampling guided by a meta RM, demonstrating significantly improved quality and scalability on various RM benchmarks compared to existing methods and potentially exceeding training-time scaling benefits.

Paper:
arxiv.org/abs/2504.02495Image
Multi-Token Attention

Author's Explanation:
x.com/jaseweston/sta…

Overview:
Multi-Token Attention (MTA) enhances LLM attention by conditioning weights on multiple query and key vectors simultaneously through convolution operations over queries, keys, and heads.

This method allows for locating relevant context using richer information, leading to enhanced performance over Transformer baselines on language modeling and long-context search tasks.

Paper:
arxiv.org/abs/2504.00927Image
Read 22 tweets
Mar 22
🚨 Last 2 week's top AI/ML research papers:

- Transformers without Normalization
- Block Diffusion
- Compute Optimal Scaling of Skills
- DAPO: An OS LLM RL System at Scale
- Teaching LLMs How to Learn with Contextual Fine-Tuning
- GR00T N1
- Why the Brain Cannot Be a Digital Computer
- RWKV-7 "Goose" with Expressive Dynamic State Evolution
- Why Do Multi-Agent LLM Systems Fail?
- Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
- Light-R1
- Where do Large Vision-Language Models Look at when Answering Questions?
- Improving Planning of Agents for Long-Horizon Tasks
- UniCombine
- How much do LLMs learn from negative examples?
- Tokenize Image as a Set
- Search-R1
- Measuring AI Ability to Complete Long Tasks
- Does Your VLM Get Lost in the Long Video Sampling Dilemma?
- Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
- Personalize Anything for Free with Diffusion Transformer
- The KoLMogorov Test: Compression by Code Generation
- Optimizing ML Training with Metagradient Descent

overview for each + authors' explanations
read this in thread mode for the best experienceImage
Transformers without Normalization

Author's Explanation:
x.com/liuzhuang1234/…

Overview:
Transformers can achieve or surpass normalized performance using a simple technique called Dynamic Tanh (DyT), replacing normalization layers with an element-wise operation inspired by tanh-like mappings observed in layer norm, validated across various tasks in computer vision and LLMs.

Paper:
arxiv.org/abs/2503.10622Image
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Author's Explanation:
x.com/mariannearr/st…

Overview:
Block diffusion language models, merge discrete denoising diffusion with autoregressive models, addressing fixed-length generation limitations and enhancing inference efficiency via KV caching and parallel token sampling.

This model introduces a training algorithm, gradient variance estimators, and data-driven noise schedules for minimized variance, achieving state-of-the-art results among diffusion models on benchmarks and enabling flexible-length sequence generation.

Paper:
arxiv.org/abs/2503.09573Image
Read 25 tweets
Feb 17
🚨This week's top AI/ML research papers:

- LLM Pretraining with Continuous Concepts
- Distillation Scaling Laws
- Can 1B LLM Surpass 405B LLM?
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Emergent Response Planning in LLM
- Improving Existing Optimization Algorithms with LLMs
- Training Language Models for Social Deduction with Multi-Agent RL
- Multi-Head Latent Attention Is All You Need
- Generative Modeling with Bayesian Sample Inference
- Scaling Pre-training to One Hundred Billion Data for Vision Language Models
- NatureLM: Deciphering the Language of Nature for Scientific Discovery
- Competitive Programming with Large Reasoning Models
- Matryoshka Quantization

overview for each + authors' explanations
read this in thread mode for the best experienceImage
LLM Pretraining with Continuous Concepts

Author's Explanation:
x.com/jaseweston/sta…

Overview:
CoCoMix, a pretraining framework, combines standard next token prediction with continuous concept learning derived from a pretrained sparse autoencoder and mixes these into the model’s hidden state by interleaving with token hidden representations.

This approach improves sample efficiency of LLMs and consistently surpasses next token prediction, knowledge distillation, and pause token insertion across language modeling and reasoning tasks.

The integration of concept learning also enhances model interpretability and steerability.

Paper:
arxiv.org/abs/2502.08524Image
Distillation Scaling Laws

Author's Explanation:
x.com/danbusbridge/s…

Overview:
This work introduces a distillation scaling law that predicts student model performance based on compute allocation between the teacher and student, enabling optimal resource distribution.

It provides compute-optimal distillation strategies for cases where a teacher exists or needs training, showing that distillation outperforms supervised pretraining when multiple students are distilled.

However, when training both a teacher and a single student, supervised learning is preferable.

Paper:
arxiv.org/abs/2502.08606Image
Read 15 tweets
Feb 9
🚨This week's top AI/ML research papers:

- Demystifying Long Chain-of-Thought Reasoning in LLMs
- OmniHuman-1
- LIMO
- s1: Simple test-time scaling
- Process Reinforcement through Implicit Rewards
- Iterate to Accelerate
- Efficient Reasoning with Hidden Thinking
- Fully Autonomous AI Agents Should Not be Developed
- DeepRAG
- Scalable-Softmax Is Superior for Attention
- The Differences Between Direct Alignment Algorithms are a Blur
- Preference Leakage
- SafeRAG
- Analyze Feature Flow to Enhance Interpretation and Steering in LMs
- Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
- ConceptAttention
- Weak-to-Strong Diffusion with Reflection
- Great Models Think Alike and this Undermines AI Oversight
- SmolLM2
- Inverse Bridge Matching Distillation
- Rethinking Mixture-of-Agents

overview for each + authors' explanations
read this in thread mode for the best experienceImage
Demystifying Long Chain-of-Thought Reasoning in LLMs

Author's Explanation:
x.com/xiangyue96/sta…

Overview:
Long Chain-of-Thought reasoning in LLMs, which enables strategies like backtracking and error correction, is significantly enhanced by scaling inference compute, and the process can be optimized through reinforcement learning.

Although supervised fine-tuning (SFT) simplifies training and improves efficiency, it's the scaling of verifiable reward signals, particularly through noisy, web-extracted solutions, which proves critical for RL especially in out-of-distribution tasks like STEM reasoning.

Basic capabilities such as error correction are initially present, incentivizing these skills for complex tasks via RL necessitates substantial compute, and measuring the development of capabilities needs an elaborate approach.

Paper:
arxiv.org/abs/2502.03373Image
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Author's Explanation:
x.com/unseenvie/stat…

Overview:
OmniHuman, a Diffusion Transformer-based framework, scales up data by incorporating mixed motion-related conditions during training, achieving highly realistic human video generation across diverse scenarios.

This framework supports varied portrait contents, handles talking, singing, human-object interactions, and diverse image styles, with specialized training principles, model architecture, and inference strategy.

OmniHuman provides greater flexibility in inputs including audio-driven, video-driven, and combined driving signals, surpassing existing end-to-end audio-driven methods in realism and versatility.

Paper:
arxiv.org/abs/2502.01061
Read 23 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(