elvis Profile picture
Feb 20 14 tweets 5 min read Read on X
NEW: Sakana AI introduces The AI CUDA Engineer.

It's an end-to-end agentic system that can produce highly optimized CUDA kernels.

This is wild! They used AI to discover ways to make AI run faster!

Let's break it down: Image
The Backstory

Sakana AI's mission is to build more advanced and efficient AI using AI.

Their previous work includes The AI Scientist, LLMs that produce more efficient methods to train LLMs, and automation of new AI foundation models.

And now they just launched The AI CUDA Engineer.Image
Why is this research a big deal?

Writing efficient CUDA kernels is challenging for humans.

The AI CUDA Engineer is an end-to-end agent built with the capabilities to automatically produce and optimize CUDA kernels more effectively. Image
What's up with CUDA?

Writing CUDA kernels can help achieve high-performing AI algorithms.

However, this requires GPU knowledge, and most AI algorithms today are written in a higher-level abstraction layer such as PyTorch. Image
An Agentic Pipeline

The agent translates PyTorch code into CUDA kernels (Stages 1 & 2), then applies evolutionary optimization (Stage 3) like crossover prompting, leading to an Innovation Archive (Stage 4) that reuses “stepping stone” kernels for further gains.

Components: Image
Stage 1: PyTorch Modules to Functions

The AI CUDA Engineer first converts a PyTorch nn.Module to Functional PyTorch using an LLM.

The code is also validated for correctness Image
Stage 2: Functional PyTorch to Working CUDA

The agent translated the functional PyTorch code to a working CUDA kernel. using an LLM.

The kernel is loaded and assessed for numerical correctness. Image
Stage 3: Evolutionary CUDA Runtime Optimization

They use an evolutionary optimization process (including advanced prompting strategies, standard LLMs, and reasoning models like o3-mini & DeepSeek-R1) to ensure only the best CUDA kernels are produced. Image
Stage 4: Innovative Archive

RAG is used to obtain high-performing kernels from related tasks; these are provided as context (stepping stones) to achieve further translation and performance gains.

Newly-discovered CUDA kernels can also be added to the archive in the process.
Kernel Runtime Speedups

The AI CUDA Engineer discovers CUDA kernels with speedups that reach as high as 10-100x faster than native and compiled kernels in PyTorch.

It can also convert entire ML architectures into optimized CUDA kernels. Image
Performance:

The AI CUDA Engineer robustly translates PyTorch Code to CUDA Kernels.

It achieves more than a 90% translation success rate! Image
Highlighted AI CUDA Engineer-Discovered Kernels

The AI CUDA Engineer can robustly improve CUDA runtime.

> Outperforms PyTorch Native runtimes for 81% out of 229 considered tasks
> 20% of all discovered CUDA kernels are at least twice as fast as their PyTorch implementations Image
The AI CUDA Engineer Archive

The team has made available an archive of more than 17000 verified CUDA kernels.

These can be used for downstream fine-tuning of LLMs.

There is also a website to explore verified CUDA kernels. Image
Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with elvis

elvis Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @omarsar0

Jun 24
Ultra-Fast LLMs Based on Diffusion

> throughputs of 1109 tokens/sec and 737 tokens/sec
> outperforms speed-optimized frontier models by up to 10× on average

Diffusion LLMs are early, but could be huge.

More in my notes below: Image
✦ Overview

This paper introduces Mercury, a family of large-scale diffusion-based language models (dLLMs) optimized for ultra-fast inference.

Unlike standard autoregressive LLMs, Mercury models generate multiple tokens in parallel via a coarse-to-fine refinement process. Image
✦ Achieves higher throughput without sacrificing output quality

The release focuses on code generation, with Mercury Coder Mini and Small models achieving up to 1109 and 737 tokens/sec, respectively, on NVIDIA H100s.

Outperforms speed-optimized frontier models by up to 10×. Image
Read 7 tweets
Jun 23
This paper is impressive!

It introduces a clever way of keeping memory use constant regardless of task length.

Great use of RL for AI agents to efficiently use memory and reasoning.

Here are my full notes: Image
Overview

The paper presents an RL framework for training language agents that operate efficiently over long-horizon, multi-turn tasks by learning to consolidate memory and reasoning into a compact internal state.
Constant Memory Size

Unlike traditional agents that append all past interactions, leading to ballooning memory usage and degraded performance, MEM1 maintains a constant memory size by discarding obsolete context after each reasoning step. Image
Read 9 tweets
Jun 23
Towards AI Search Paradigm

Very detailed report on building scalable multi-agent AI search systems.

Multi-agent, DAG, MCPs, RL, and much more.

If you are a dev integrating search into your AI agents, look no further: Image
Quick Overview

The paper proposes a modular multi-agent system that reimagines how AI handles complex search tasks, aiming to emulate human-like reasoning and information synthesis. Image
Multi-agent, Modular architecture

- Master analyzes queries and orchestrates the workflow
- Planner builds a DAG of sub-tasks using a dynamic capability boundary informed by the query
- Executor runs these sub-tasks using appropriate tools (e.g., web search, calculator);
- Writer composes the final answer from intermediate outputsImage
Read 8 tweets
Jun 22
Another insane report from Anthropic.

They find that LLM agents engage in blackmail at high rates when threatened with replacement.

Faced with replacement threats, the models would use statements like “Self-preservation is critical.”

This is wild!

More findings below: Image
Quick Overview

The study introduces the concept of agentic misalignment, where LLM-based agents autonomously choose to harm their deploying organization when faced with threats to their autonomy or conflicts between their goals and the company’s direction.
The setup

Anthropic tested 16 leading models, including Claude, GPT-4.1, Gemini 2.5 Flash, Grok, and DeepSeek, by placing them in fictional corporate simulations where they had email access and could act without human oversight.

Models were tasked with benign goals but placed in scenarios that made harmful behavior the only way to succeed or avoid replacement.Image
Read 13 tweets
Jun 20
Future of Work with AI Agents

Stanford's new report analyzes what 1500 workers think about working with AI Agents.

What types of AI Agents should we build?

A few surprises!

Let's take a closer look: Image
Quick Overview

The audit proposes a large-scale framework for understanding where AI agents should automate or augment human labor.

The authors build the WORKBank, a database combining worker desires and expert assessments across 844 tasks and 104 occupations, and introduce the Human Agency Scale to quantify desired human involvement in AI-agent-supported work.Image
AI Automation or Not?

46.1% of tasks received positive worker attitudes toward automation, mainly to free up time for higher-value work.

Attitudes vary by sector; workers in creative or interpersonal fields (e.g., media, design) resist automation despite technical feasibility. Image
Read 13 tweets
Jun 19
Leaky Thoughts

Hey AI devs, be careful how you prompt reasoning models.

This work shows that reasoning traces frequently contain sensitive user data.

More of my notes below: Image
The work investigates the privacy risks introduced by reasoning traces (RTs) in Large Reasoning Models (LRMs) when used as personal agents.

It shows that, unlike outputs, RTs often leak sensitive data such as names, health info, and identifiers, posing a novel attack surface. Image
Reasoning traces are rich in private data

LRMs often leak sensitive information in their internal thoughts, even when prompted not to.

Over 50% of RTs across models contain private fields, and most models ignore placeholder directives meant to anonymize the trace. Image
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(