More from @JacksonAtkinsX

Jackson Atkins

@JacksonAtkinsX

Sep 16

Princeton AI just solved impossible teamwork.

New method: 95% win rate.
Old methods: 0% win rate.

They did it by replacing complex rewards with a single goal, unlocking self-taught cooperation.

Here's how it works:

The technique, Independent Contrastive RL (ICRL), teaches teamwork by changing the AI's objective.

1. Define the Goal, Not the Path. Forget complex rewards. You provide the system with a single example of the final "win state." This is the only guidance the agents get.

2. Learn by Comparison. The system's critic learns to distinguish between actions that lead toward the goal and those that don't. This is done via a contrastive loss (InfoNCE), which trains it to identify the true future state from a batch of negative samples.

3. Create a Shared Map. This contrastive process forces the system to build an internal map where every agent can understand its "distance" to the goal. This turns a single win/loss signal into a continuous signal.

4. Teamwork Emerges. Each agent is an independent learner with a decentralized policy. But because they all use this same shared map to navigate, cooperative strategies emerge without a central commander.

Why this matters:

This changes how we can build and deploy cooperative AI agents.

- Business Leaders: You can now tackle complex coordination problems without the time and cost of designing a perfect reward system. This makes many advanced automation projects feasible.

- Practitioners: This is a solution for those impossible sparse-reward MARL problems. Just define the win state and let the agents learn.

- Researchers: This paper challenges the long-held belief that explicit rewards are necessary for complex exploration. It shows that a simple goal is enough to drive emergent, intelligent cooperation.

Princeton's new method (blue) quickly learns how to win complex battles.

A prior top method like IPPO (red) flatlines, and doesn't win a single game.

Self-Supervised Goal-Reaching Results in Multi-Agent Cooperation and Exploration

Submitted on 12 Sep 2025

arxiv.org/abs/2509.10656

Read 4 tweets

Jackson Atkins

@JacksonAtkinsX

Sep 6

Meta Superintelligence Labs just made LLMs handle 16x more context and unlocked up to a 31x speedup. 🤯

Their new REFRAG framework rethinks RAG from the ground up to achieve this, all with zero drop in accuracy.

Here's how it works:

The core problem with long context is simple: making a document 2x longer can make your AI 4x slower.

This is because an LLM's attention mechanism is expensive. Its cost and memory usage grow quadratically (N²) with the length of the text.

REFRAG sidesteps this.

Compress: A small, lightweight encoder first reads the retrieved documents. It compresses every 16-token chunk of text into a single, dense vector called a "chunk embedding," which captures the semantic essence.

Shorten: The main LLM is then fed a sequence of these embeddings instead of the raw tokens. The input it has to process is now 16x shorter.

Accelerate: Because the input sequence is so short, the quadratic attention calculation is cheaper, and the KV cache (the primary memory hog in LLMs) is smaller. This is what unlocks the 30.85x speedup.

Select: To guarantee accuracy, a Reinforcement Learning (RL) policy acts as a quality control supervisor. It identifies the most critical, information-dense chunks and tells the system not to compress them, ensuring key details are preserved.

Why this matters:

REFRAG makes the promise of large-context RAG a production reality.

Business Leaders: This is how you scale AI applications profitably. Deliver more powerful answers to users, analyzing entire reports, not just pages, all while being faster and cheaper.

Practitioners: You no longer need to choose between large contexts and reasonable memory budgets. REFRAG lets you have both. It's an architectural win without architectural changes.

Researchers: This work shows that co-designing decoding strategies with application-specific data patterns (like RAG's attention sparsity) yields results beyond generic, brute-force solutions.

REFRAG: Rethinking RAG based Decoding

Submitted 1 Sept 2025

Code will be added to Github in the future.

arxiv.org/abs/2509.01092

Meta Superintelligence Lab's REFRAG hits >16x TTFT acceleration over the CEPE baseline at 16k tokens.

This chart shows why: REFRAG's speedup (blue line) scales exponentially with context size, while the baseline's (red) is linear.

Read 5 tweets

Jackson Atkins

@JacksonAtkinsX

Aug 25

NVIDIA research just made LLMs 53x faster. 🤯

Imagine slashing your AI inference budget by 98%.

This breakthrough doesn't require training a new model from scratch; it upgrades your existing ones for hyper-speed while matching or beating SOTA accuracy.

Here's how it works:

The technique is called Post Neural Architecture Search (PostNAS). It's a revolutionary process for retrofitting pre-trained models.

Freeze the Knowledge: It starts with a powerful model (like Qwen2.5) and locks down its core MLP layers, preserving its intelligence.

Surgical Replacement: It then uses a hardware-aware search to replace most of the slow, O(n²) full-attention layers with a new, hyper-efficient linear attention design called JetBlock.

Optimize for Throughput: The search keeps a few key full-attention layers in the exact positions needed for complex reasoning, creating a hybrid model optimized for speed on H100 GPUs.

The result is Jet-Nemotron: an AI delivering 2,885 tokens per second with top-tier model performance and a 47x smaller KV cache.

Why this matters to your AI strategy:

- Business Leaders: A 53x speedup translates to a ~98% cost reduction for inference at scale. This fundamentally changes the ROI calculation for deploying high-performance AI.

- Practitioners: This isn't just for data centers. The massive efficiency gains and tiny memory footprint (154MB cache) make it possible to deploy SOTA-level models on memory-constrained and edge hardware.

- Researchers: PostNAS offers a new, capital-efficient paradigm. Instead of spending millions on pre-training, you can now innovate on architecture by modifying existing models, dramatically lowering the barrier to entry for creating novel, efficient LMs.

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Research by @NVIDIAAI

Submitted 21 Aug 2025

arxiv.org/abs/2508.15884…

Github link.
github.com/NVlabs/Jet-Nem…

Read 7 tweets

Jackson Atkins

@JacksonAtkinsX

Jul 21

Apple research just revealed a way to make LLMs 5.35x faster. 🤯

That’s not a typo. They've found a method to get a >500% speedup for code & math tasks, with ZERO quality loss.

Here's how they're unlocking AI model's "latent potential": 🧵

2/5: The secret isn't a new model, but a minimal fine-tuning of existing ones (like Llama 3).

They teach the model to predict a chunk of up to 8 future tokens in a single forward pass, instead of just one.

3/5: The key is "Gated LoRA." This clever technique applies the new training ONLY to the multi-token prediction pathway.

It freezes the original model's knowledge, which means no "catastrophic forgetting" or performance degradation.

Quality is 100% preserved.

Read 6 tweets

Share this page!

Enter URL or ID to Unroll

Jackson Atkins

Try unrolling a thread yourself!

More from @JacksonAtkinsX

Jackson Atkins

Jackson Atkins

Jackson Atkins

Jackson Atkins

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!