TuringPost Profile picture
Newsletter exploring AI & ML - AI 101 - ML techniques - AI Business insights - Global dynamics - ML History Led by @kseniase_ Save hours of research 👇🏼
2 subscribers
Feb 18 5 tweets 2 min read
3 models to pay attention to:

▪️ LM2: Large Memory Models

- Uses a Transformer architecture with a memory module to improve long-context reasoning.
- Outperforms RMT by 37.1% and excels in multi-hop inference.

▪️ NatureLM:

- Is trained across scientific domains.
- Enhancing tasks like SMILES-to-IUPAC translation and CRISPR RNA design for cross-domain applications.

▪️ Goedel-Prover:

- Advances formal proof generation
- Achieves 57.6% Pass@32 on miniF2F using expert iteration and statement formalizers.

Find the links below👇Image
Image
Image
1. LM2: Large Memory Models by Convergence Labs Ltd.

huggingface.co/papers/2502.06…
Feb 18 24 tweets 14 min read
The freshest AI/ML research of the week:

Our top 7
▪️ Matryoshka Quantization
▪️ LLM Pretraining with Continuous Concepts
▪️ LLMs can easily learn to reason from demonstrations
▪️ Forget what you know about LLMs evaluations – LLMs are like a chameleon
▪️ Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
▪️ Hephaestus
▪️ SynthDetoxM Dataset

▪️ The Curse of Depth in LLMs
▪️ InfiniteHiP
▪️ Distillation Scaling Laws
▪️ TransMLA: Multi-Head Latent Attention
▪️ Logical reasoning in LLMs: A survey
▪️ ReasonFlux
▪️ How Stanford’s s1 surpasses DeepSeek-R1
▪️ The Stochastic Parrot on LLM’s Shoulder
▪️ Training LMs for Social Deduction with Multi-Agent RL
▪️ Towards Internet-scale training for agents
▪️ WorldGUI
▪️ CoSER: Coordinating LLM-Based Persona Simulation
▪️ Scaling Pre-training to One Hundred Billion Data for VLMs
▪️ Adapting Language-Specific LLMs to Reasoning Models

🧵Image
Image
Image
1. Matryoshka Quantization from @GoogleDeepMind

Introduces MatQuant, a multi-scale quantization method that mixes int2, int4, and int8 layers for efficient model deployment

huggingface.co/papers/2502.06… x.com/12714828789589…
Feb 16 8 tweets 4 min read
Free useful guides on model distillations:

1. Model Distillation guide from @OpenAI
2. Knowledge Distillation tutorial by @PyTorch
3. Jetson Introduction to Knowledge Distillation by @nvidia
4. Tutorial on Knowledge Distillation with @kerasteam
5. @huggingface's guides:
- Knowledge Distillation
- Knowledge Distillation for Computer Vision

Save the link and check out the links below 👇Image 1. Model Distillation guide from @OpenAI

Explains this process step-by step, including
- storing outputs from a large model
- evaluating both large and small models
- create training data for a small model
- assess the fine-tuned small model

platform.openai.com/docs/guides/di…
Feb 15 10 tweets 3 min read
Distillation involves using a large teacher model to train a smaller student one.

But can we predict a distilled model’s performance based on teacher quality, student size, data volume, etc.?

@Apple and @UniofOxford explored this and developed distillation scaling laws.

Here are the key takeaways👇Image 1. A good teacher doesn’t always mean a better student:

If a teacher is too strong, the student might struggle to learn from it, leading to worse performance.
This is called the capacity gap — when the student isn’t powerful enough to properly mimic the teacher.
Feb 10 23 tweets 13 min read
The freshest AI/ML research of the week:

Our top 4
▪️ AlphaGeometry2
▪️ ZebraLogic
▪️ Limo: Less is More for Reasoning
▪️ Great Models Think Alike and this Undermines AI Oversight

▪️ Activation-Informed Merging of LLMs
▪️ Content-Format Integrated Prompt Optimization (CFPO)
▪️ BOLT: Bootstrapping Long Chain-of-Thought
▪️ Token Assorted: Mixing Latent & Text Tokens
▪️ ScoreFlow
▪️ The Jumping Reasoning Curve?
▪️ Demystifying Long Chain-of-Thought Reasoning in LLMs
▪️ MAGA
▪️ ParetoQ: Scaling Laws in Extremely Low-Bit LLM Quantization
▪️ Analyze Feature Flow to Enhance Interpretation and Steering in LMs
▪️ PILAF
▪️ DuoGuard
▪️ Limitations of LLMs in Clinical Problem-Solving
▪️ AI and Legal Analysis
▪️ HackerRank-ASTRA
▪️ The Open-Source Advantage in LLMs
▪️ UltraIF: Advancing Instruction-Following

🧵Image
Image
Image
1. AlphaGeometry2 (Olympiad Geometry Solver) from @GoogleDeepMind

Enhances AlphaGeometry to solve IMO-level geometry problems with a broader formal language

huggingface.co/papers/2502.03… x.com/12714828789589…
Feb 10 6 tweets 3 min read
Sliding Tile Attention (STA) speeds up video generation up to 3.53x times.

It focuses only on small, relevant regions at a time and moves across the video in a sliding pattern.

STA processes larger chunks (tiles) at once, making it faster and more hardware-efficient.

Here's how it works:Image
Image
Firstly, what's wrong with current methods?

3D attention, that is generally used in Diffusion Transformers (DiTs), processes all video frames at once, treating every pixel separately, which takes up a huge amount of computing power—about 70% of the total effort.

The problem with traditional Sliding Window Attention (SWA) is that it creates "mixed blocks," which are inefficient for GPUs.

That's why researchers proposed Sliding Tile Attention (STA) method.
Feb 7 5 tweets 3 min read
Flow Q-learning (FQL) from @berkeley_ai is a new offline reinforcement learning method that improves how AI learns from past data.

Instead of training a step-by-step flow policy, FQL:

• Trains a flow policy only to mimic past actions using behavior cloning (BC).
• Trains a one-step policy with RL, which learns the best actions while using the flow policy for guidance.

Here are the details:Image What problem does FQL solve?

Offline RL balances between sticking to past behaviors and choosing the best possible actions.

If do this directly with a flow policy, training becomes unstable and expensive because it requires backpropagation through time.

So here comes FQL👇 Image
Feb 3 6 tweets 3 min read
s1 - a new simple! open-source test-time scaling approach from @Stanford.

With s1 researchers found the simplest way to improve reasoning through test-time scaling.

s1's innovations:

• A small s1K dataset with 1,000 tough and diverse questions, each with detailed reasoning steps for training.
• Budget forcing to controls how long the model thinks.

What about the results of s1?

- It gets up to 27% higher scores on math problems compared to OpenAI’s o1-preview model.
- Performance jumped from 50% to 57% on a math competition even without extra test-time optimizations.

More details:Image s1K dataset:

The researchers started with a big pool of 59,000 questions from many different sources. But instead of using all of them, they carefully filtered the dataset down to just 1,000 best questions by focusing on quality, difficulty, and diversity.
Feb 3 7 tweets 3 min read
Over-Tokenized Transformers framework changes how models handle tokens.

Normally, input and output tokens come from the same vocabulary, but here, @ByteDanceOSS made them separated. They use larger input vocabularies with multi-word tokens instead of just smaller pieces like single words or subwords.

Their key finding is:

Using more complex input tokens makes the model learn better, no matter its size.

Here are the details:Image
Image
Over-Tokenized Transformers (OT) approach separates input and output tokenization, allowing each to be optimized separately. Two main techniques are used to obtain OT:

- Over-Encoding (OE) (improves input tokenization)
- Over-Decoding (OD) (improves output tokenization)
Jan 30 9 tweets 4 min read
Mixture-of-Mamba (MoM) from @Stanford, @CarnegieMellon and @AIatMeta expands the benefits of Transformers to State Space Models (SSMs), making them better for multimodal tasks.

MoM selects the best processing pathways for text, images, or speech dynamically, using modality-aware sparsity (it's like a router in MoE).

Benefits:

• MoM only needs 25-40% of the processing power (FLOPs) of traditional methods.

• It performs well across various multimodal settings, including:

- Transfusion (text + continuous images),
- Chameleon (text + discrete images)
- a new three-modality setup (text + images + speech).

Here's how it works:Image The core idea:

By separating modality-specific components from shared components of the architecture, MoM efficiently processes multimodal data while keeping computational costs low.
Jan 29 8 tweets 3 min read
.@Microsoft Corporation introduced a smarter retrieval system: Chain-of-Retrieval Augmented Generation, or CoRAG.

It leverages a dynamic retrieval process, which means that the model can:

• Retrieve information step by step and adjust it.
• Reformulate queries if needed.
• Decide how many retrieval steps to take.

Here are the details on how CoRAG works and how it is trained:Image The CoRAG framework consists of 3 main parts:

• Generating retrieval chains
• Training the model on these enhanced datasets
• Adjusting how much computing power is used at test time
Jan 28 22 tweets 13 min read
The freshest AI/ML research of the week:

Our top 10:
▪️ Demons in the Detail
▪️ Autonomy-of-Experts Models
▪️ Evolving Deeper LLM Thinking
▪️ Agent-R
▪️ Reasoning Language Models: A Blueprint
▪️ SRMT: Shared Memory for Multi-Agent Lifelong Pathfinding
▪️ UI-TARS
▪️ Trading Inference-Time Compute for Adversarial Robustness
▪️ LLMs Can Plan Only If We Tell Them
▪️ Good Things Come in Small Packages

▪️ O1-Pruner
▪️ Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
▪️ Test-Time Preference Optimization
▪️ Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
▪️ Chain-of-Retrieval Augmented Generation
▪️ Can We Generate Images with CoT?
▪️ InternLM-XComposer2.5-Reward
▪️ Evolution and the Knightian Blindspot of Machine Learning
▪️ Debate Helps Weak-to-Strong Generalization
▪️ Multiple Predictions of Others’ Actions in the Human Brain

🧵Image
Image
Image
1. Demons in the Detail by @Alibaba_Qwen

Introduces load-balancing loss for training Mixture-of-Experts models.

huggingface.co/papers/2501.11…Image
Jan 25 9 tweets 3 min read
.@GoogleAI proposed a Chain-of-Agents (CoA) framework that uses multiple AI agents working together to reason through long texts.

It outperforms RAG and full-context processing by up to 10%!

Here's how CoA works:

1. Worker agents handle different parts of the text and share their insights to the next agent.
2. A manager agent combines these insights into a final, coherent output.

Details below:Image 1. Worker agents:

Each worker agent processes one chunk, combines it with the previous agent’s findings, and passes the result, called a "communication unit", to the next worker.

• For question answering, the workers extract evidence from their chunks.
• For summarization, they summarize their assigned chunks of the text.
• For code completion, they create summaries of the code, including function or class details.
Jan 12 7 tweets 3 min read
Search-o1 integrates large reasoning models (LRMs) like OpenAI's o1 with agentic search workflow.

It combines reasoning with an agentic RAG mechanism and a special knowledge refinement module to improve accuracy and reliability of LRMs.

Here's how it works: Image 1. Agentic RAG:

When the model identifies a knowledge gap, it creates search queries to retrieve relevant external documents from a knowledge base or web corpus dynamically.

These documents are added to the reasoning chain to fill in missing knowledge.
Jan 9 10 tweets 4 min read
Agent Laboratory is a new tool designed to help researchers implement their own ideas, rather than replacing them.

@AMD and @JohnsHopkins made it open-source and adaptable to different levels of computing power.

What are the benefits?

- Researchers can input their idea, and Agent Laboratory uses AI agents to generate a research report and code repository.
- Users can choose their feedback level to suit their needs.
- A Co-pilot mode allows to work alongside the AI for higher-quality outputs.

Here's how it works in detail:Image 1. Literature review

The PhD agent gathers and analyzes research papers relevant to the user's idea, using online tools like the arXiv API.

It summarizes abstracts, retrieves full texts if needed, selects and organizes the most useful papers into a curated review.

This process is iterative - the PhD agent refines its search and selection until it builds a comprehensive review.
Dec 29, 2024 7 tweets 3 min read
Wonderful Matrices is a new foundation model architecture designed to make ML models more efficient and versatile.

It combines techniques, such as:

▪️ Rotary position embedding: Helps the model understand the order and position of words or elements in data.

▪️ Dynamic mask attention: Helps focus on relevant parts of the data and ignore unnecessary ones.

▪️Cross domain Mixture of Experts: Improves how the model learns and stores general and specific knowledge.

Here are the details:Image 1. Rotary position embedding (RPE):

It uses a rotary matrix to encode positional information based on the position index. By combining absolute and relative encoding, it allows efficient position-aware computations without altering large score matrices.

This is perfect for fast, linear attention and algorithms like Quadratic Causal Self-Attention and State Space Duality.Image
Dec 22, 2024 9 tweets 3 min read
Tree-of-Code (ToC) is a new way to help LLM-based agents perform better decision-making and execution.

It combines 2 powerful ideas:

- Tree-of-Thought for structured problem-solving
- CodeAct, which generates Python code for actions, for task-planning efficiency

ToC treats code as a way of thinking and builds a tree-like system.

▪️ Tests shows that ToC is more reliable than Tree-of-Thought and more accurate than CodeAct.
▪️ It works with different AI models without needing extra training.

So, how does ToC work? 🧵Image 1. Tree-of-Code (ToC) creates a complete pipeline where the AI plans and solves tasks step-by-step, turning its reasoning into clear, executable code.

It uses a tree-like system to explore multiple ways of generating code and solving problems, making the process more reliable and accurate.
Dec 1, 2024 13 tweets 5 min read
Top 10 GitHub Repositories to master ML, AI and Data Science:

• 100 Days of ML Code
• Data Science For Beginners
• Awesome Data Science
• Data Science Masters
• Homemade Machine Learning
• 500+ AI Projects List with Code
• Awesome Artificial Intelligence
• Machine Learning Design Interview
• Data Science Interviews
• Data Science Best Resources
+ Our twitter library

Don't forget to save the list!

Check out the links below 👇Image 1. 100 Days of ML Code - 45.6k stars

A plan for studying Machine Learning aspects, such as Data PreProcessing, simple and multiple linear and logistic regression, all math behind ML and much more.

github.com/Avik-Jain/100-…
Nov 29, 2024 7 tweets 3 min read
Natural Language Reinforcement Learning (NLRL) redefines Reinforcement Learning (RL).

The main idea:
In NLRL, the core parts of RL like goals, strategies, and evaluation methods are reimagined using natural language instead of rigid math.

What are the benefits?

- NLRL uses not only single numbers but also detailed feedback
- Interpretable and easier to understand
- Human-like decision-making

Let's explore this approach more precisely🧵Image 1. Text-based MDP (Markov Decision Process):

In NLRL, the states, actions, and feedback from the environment are described using natural language. For example, NLRL starts with a language goal, like "reach the goal" or "open the door."
Nov 9, 2024 7 tweets 2 min read
In one of our first "A path towards AGI" posts we discussed Neuro-symbolic systems.

Here's a new example of their implementation👇

Neuro-Symbolic Predicates (NSPs) are smart rules that help robots think by combining visual perception (neural) with logical rules (symbolic). With NSPs robots can easier plan and tackle complex tasks.

NSPs use programming basics (conditions, loops) and can connect with VLMs that understand images and text.

Here are the details about:
- 2 types of NSPs
- selecting NSPs
- task planning with learning High-Level Actions (HLAs)

🧵Image 2 types of NSPs:

• Primitive NSPs: Directly interact with what the robot can see or feel. A primitive NSP might ask the VLM if the robot is holding something or if the gripper is open.

• Derived NSPs: Depend on other NSPs rather than direct observations. For example, they determine if an object is on a plate by checking if it's on another object that is on the plate.
Nov 8, 2024 12 tweets 4 min read
Do LoRA and full fine-tuning actually change the model in the same way?

@MIT_CSAIL identified key differences between LoRA and full fine-tuning:

- Various adaptation to task
- Performance
- LoRA's intruder dimensions

What are these intruder dimensions and what impacts them?👇 Image 1. Intruder dimensions:

LoRA fine-tuning introduces new directions in the model’s weights, called intruder dimensions. These don’t appear in full fine-tuning and show up as low-similarity or “outlier” directions in the model’s weights. Image