Post

More from @rohanpaul_ai

Rohan Paul

@rohanpaul_ai

Feb 12

An open-source 9B model with a 1M-token context and an Apache 2.0 license has just been released on Hugging Face. It’s designed to run on a single consumer-class GPU, such as the NVIDIA RTX 5090.

This model breaks the "Compute Wall" and the "Memory Wall," achieving 3.5× faster inference and significantly lower KV-cache overhead compared to dense baselines.

This is no longer an "either-or" choice between performance and efficiency.

How?
Full Attention mechanism's computational complexity grows quadratically with length, making edge-side long-text inference "slow and memory-intensive."

Solution: MiniCPM-SALA adopts a golden ratio of 75% Linear Attention + 25% Sparse Attention.

MiniCPM-SALA (9B) is OpenBMB’s long-context model aimed at running 1M to 2M tokens on a single GPU without the memory spikes and OOM failures common with dense full attention. The main idea is a sparse plus linear hybrid that keeps long-range recall accurate while keeping cost manageable as context grows.

- Architecturally, about 25% of layers use InfLLM-V2 style sparse attention for high-fidelity long-range retrieval, while about 75% use Lightning linear attention, so compute scales close to linearly with sequence length. Instead of a uniform interleave, the sparse layers are placed via a 1:3 layer-selection pattern.

- For positional handling and stability, SALA uses hybrid positional encoding (HyPE): RoPE stays in the linear layers but is removed in sparse layers to avoid long-range decay, and it adds QK-normalization plus output gating to improve stability and reduce attention-sink behavior.

- Training is done by converting a pretrained Transformer, not training from scratch. It starts from a MiniCPM-4.0 intermediate checkpoint trained on 7T tokens, then applies HALO conversion, keeping the 1st and last layers unconverted and initially training only the converted linear layers.

Conversion plus post-training totals about 2T tokens, framed as about a 75% cost reduction versus an 8T scratch run, with context ramping from 512 to 4K, then to 32K, 160K, and 520K, followed by SFT at 64K and 140K.

Reported results keep standard performance strong (76.53 average, HumanEval 95.12, AIME24 83.75, AIME25 78.33)

While improving long-context behavior (RULER 92.65 at 64K, 89.37 at 128K). It also reports single-GPU 1M-token inference where Qwen3-8B OOMs, 256K TTFT, improving from 180.8s to 51.6s, and RULER holding at 86.3 at 1M and 81.6 at 2M without YaRN.

Go to Hugging Face/GitHub to test the model capabilities yourself.

🧵 2. The diagram compares a standard Transformer attention block on the right with the “hybrid” replacement block on the left.

On the right, softmax attention needs to keep a big key value cache for every past token, so as the context gets huge, the GPU runs out of memory and also slows down.

On the left, most layers swap that attention for an RNN-style “mixer” that keeps a running state S_t, so the model carries a compressed summary forward instead of storing per-token history, which makes very long context much cheaper in memory and compute.

The numbered marks show small but important fixes they apply during their HALO conversion, mainly hybrid positional encoding (HyPE) plus a few stability tweaks so the hybrid layers behave like the original Transformer at short context but do not fall apart at long context.

MiniCPM-SALA applies the same core idea at scale, keeping only 25% heavier attention style layers and making 75% of layers use cheaper attention variants, and the project claims this makes 1M token inference practical on a single RTX 5090 because KV cache pressure drops hard.

🧵 3. “Hybridizing attention” can keep quality while cutting long context memory and latency.

MiniCPM-SALA is the productized version of that same idea

In the paper, the researchers take a dense Transformer family (Qwen3) and convert it into a hybrid model they call HypeNet using a distillation recipe called HALO (Hybrid Attention via Layer Optimization), then they show HypeNet keeps performance while using less memory and avoiding the long context slowdown and out-of-memory failure you see in dense attention.

Also, the hybrid model can push higher throughput at a given quality level, meaning it generates tokens faster for the same kind of task, while the dense baseline slows down.

The right plot shows that, as context grows toward 1M, the dense Qwen3 version runs out of GPU memory, but the hybrid version still runs and keeps time per output token much lower.

The key architectural reason is that most layers stop using full softmax attention that needs a large key value cache for every past token, and instead use a cheaper hybrid or linear style mixer plus positional encoding changes like HyPE, so long context does not break.

This is the same general idea MiniCPM-SALA is selling: keep only a smaller fraction of heavier attention layers and make most layers cheaper, which is why they claim 1M token inference on a single RTX 5090.

Read 9 tweets

Rohan Paul

@rohanpaul_ai

Jan 14

DeepSeek's innovation level is really at another level.

Its new paper just uncovered a new U-shaped scaling law.

Shows that N-grams still matter. Instead of dropping them in favor of neural networks, they hybridize the 2. This clears up the dimensionality problem and removes a big source of inefficiency in modern LLMs.

Uncovers a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram).

Right now, even “smart” LLMs waste a bunch of their early layers re-building common phrases and names from scratch, because they do not have a simple built-in “lookup table” feature.

Mixture-of-Experts already saves compute by only running a few expert blocks per token, but it still forces the model to spend compute to recall static stuff like named entities and formula-style text.

Engram is basically a giant memory table that gets queried using the last few tokens, so when the model sees a familiar short pattern it can fetch a stored vector quickly instead of rebuilding it through many layers.

They implement that query using hashed 2-gram and 3-gram patterns, which means the model always does the same small amount of lookup work per token even if the table is huge.

The big benefit is that if early layers stop burning time on “static reconstruction,” the rest of the network has more depth left for real reasoning, and that is why reasoning scores go up even though this sounds like “just memory.”

The long-context benefit is also solid, because offloading local phrase glue to memory frees attention to focus on far-away relationships, and Multi-Query Needle-in-a-Haystack goes from 84.2 to 97.0 in their matched comparison.

The system-level big deal is cost and scaling, because they show you can offload a 100B memory table to CPU memory and the throughput drop stays under 3%, so you can add a lot more “stored stuff” without needing to fit it all on GPU memory.

🧩 The core problem

The paper splits language modeling into 2 jobs, deep reasoning that needs real computation, and local stereotyped patterns that are basically fast recall.

Transformers do not have a native lookup block, so they burn early attention and feed-forward layers to rebuild static stuff like multi-token entities and formulaic phrases.

That rebuild is expensive mainly because it eats sequential depth, meaning the model spends layers on trivia-like reconstruction before it even starts the harder reasoning steps.

Classical N-gram models already handle a lot of this local dependency work with cheap table access, so forcing a Transformer to relearn it through compute is a design mismatch.

Engram is their way of turning “lookup” into a first-class primitive that lives next to MoE, instead of being faked by extra neural layers.

Engram adds a huge hashed N-gram memory table that gets queried with a fixed amount of work per token, so early layers stop wasting compute rebuilding names and stock phrases.

They show the best results when about 20% to 25% of the sparse budget moves from experts into this memory, while total compute stays matched.

Engram hits 97.0 on Multi-Query Needle-in-a-Haystack, while the matched MoE baseline hits 84.2.

Read 9 tweets

Rohan Paul

@rohanpaul_ai

Jan 10

Anthropic has launched improved safety classifiers aimed at stopping AI jailbreaks.

The key idea is to add a cheap “early warning” safety check that runs all the time, and only turn on a stronger safety check when something looks suspicious, while also judging the user prompt and the model’s reply together so hidden or coded jailbreak requests do not slip through.

A new “Constitutional Classifiers++” setup stops universal jailbreaks without making the model expensive or annoying to use, by using a cheap internal probe to screen everything and only escalating suspicious chats to a stronger context-aware classifier.

40x less compute than a full exchange classifier, and 0.05% refusals on production traffic.

They proposed an “ensemble cascade” where an exchange classifier reads the prompt and the model’s partial answer together to beat obfuscation tricks, but it runs rarely because a near-free linear probe, built off the model’s own internal activations, decides when escalation is needed.

🧠 The idea

A “Constitutional Classifier” is a separate model whose whole job is to read a conversation and decide whether the assistant should continue or refuse.

“Constitutional” here means the classifier is trained against a written rule set that says what is allowed and what is disallowed, like helping with normal chemistry learning but not with chemical weapon steps.

The core deployment problem is that defenses need to be cheap and rarely block normal users, because even a tiny refusal rate becomes real friction at scale.

This paper treats compute cost and refusal rate as first-class constraints, not afterthoughts, and the whole design is built around that.

🚨 What broke in the last system

The earlier setup split the job into 2 separate checks, one model looked only at the user input, and another looked only at the assistant output.

That split sounds clean, but it creates a blind spot when the harmful meaning is distributed across the prompt and the reply, so neither side looks bad alone.

Reconstruction-style attacks exploit this by hiding a harmful request inside a larger harmless-looking blob, then asking the model to piece it back together while answering.

Obfuscation-style attacks exploit it by pushing the model to speak in code words, riddles, or substitutions that look safe if the output is judged without the prompt context.

Some of these attack styles also damage normal model capability, and the paper shows GPQA Diamond accuracy dropping from 74.2% to 32.3% under 1 such jailbreak pattern, which signals the attack is “expensive” but still not something to rely on.

Read 7 tweets

Rohan Paul

@rohanpaul_ai

Jan 1

🚨 BREAKING: DeepSeek dropped a core Transformer architecture improvement.

A traditional transformer is basically a long stack of blocks, and each block has a “main work path” plus a “shortcut path” called the residual connection that carries the input around the block and adds it back at the end.

Each block in this original transformer architecture does some work (self attention or a small feed forward network), then it adds the block’s input back onto the block’s output, which is why people describe it as a “main path” plus a “shortcut path.”

Hyper-Connections is a drop-in change to that shortcut path, because instead of carrying 1 stream of activations through the stack, the model carries a small bundle of parallel streams, then it learns how to mix them before a block and after a block.

Standard Transformers pass information through 1 residual stream. Hyper-Connections turn that into n parallel streams, like n lanes on a highway. Small learned matrices decide how much of each lane should mix into the others at every layer.

In a normal residual connection, each layer takes the current hidden state, runs a transformation, then adds the original back, so information can flow forward without getting stuck.

In this new Hyper-Connections, the layer does not see just 1 hidden state, it sees a small bundle of them, and before the layer it learns how to mix that bundle into the input it will process.

So in a traditional transformer block, wherever you normally do “output equals input plus block(input),” Hyper-Connections turns that into “output bundle equals a learned mix of the input bundle plus the block applied to a learned mix,” so the shortcut becomes more flexible than a plain add.

After this learned layer, the "Hyper-Connections" mechanism again learns how to mix the transformed result back into the bundle, so different lanes can carry different kinds of information, and the model can route signal through the shortcut in a more flexible way.

The catch is that if those learned mixing weights are unconstrained, stacking many blocks can make signals gradually blow up or fade out, and training becomes unstable in big models.

This paper proposes mHC, which keeps Hyper-Connections but forces every mixing step to behave like a safe averaging operation, so the shortcut stays stable while the transformer still gets the extra flexibility from multiple lanes.

---

The paper shows this stays stable at 27B scale and beats both a baseline and unconstrained Hyper-Connections on common benchmarks.

HC can hit about 3000x residual amplification, mHC keeps it around 1.6x.

This image compares 3 ways to build the shortcut path that carries information around a layer in a transformer.

The left panel is the normal residual connection, where the model adds the layer output back to the original input so training stays steady as depth grows.

The middle panel is Hyper-Connections, where the model keeps several parallel shortcut streams and learns how to mix them before the layer, around the layer, and after the layer, which can help quality but can also make the shortcut accidentally amplify or shrink signals when many layers stack.

The right panel is mHC, which keeps the same Hyper-Connections idea but forces those mixing steps to stay in a constrained safe shape every time, so the shortcut behaves like a controlled blend and stays stable at large scale.

What “hyper-connection” means here.

You widen the residual from size C to n×C, treat it as n streams, and learn 3 tiny mixing pieces per layer.

One mixes the residual streams with each other, this is the crucial one. One gathers from the streams into the layer. One writes results back to the streams.

The paper’s contribution is to keep the first one in the safe “doubly stochastic” set, so it mixes without amplifying.

Read 10 tweets

Rohan Paul

@rohanpaul_ai

Dec 25, 2025

A MASSIVE 303 page study from the very best Chinese Labs.

The paper explains how code focused language models are built, trained, and turned into software agents that help run parts of development.

These models read natural language instructions, like a bug report or feature request, and try to output working code that matches the intent.

The authors first walk through the training pipeline, from collecting and cleaning large code datasets to pretraining, meaning letting the model absorb coding patterns at scale.

They then describe supervised fine tuning and reinforcement learning, which are extra training stages that reward the model for following instructions, passing tests, and avoiding obvious mistakes.

On top of these models, the paper surveys software engineering agents, which wrap a model in a loop that reads issues, plans steps, edits files, runs tests, and retries when things fail.

Across the survey, they point out gaps like handling huge repositories, keeping generated code secure, and evaluating agents reliably, and they share practical tricks that current teams can reuse.

Overview of the evolution of code large language models (Code-LLMs) and related ecosystems from 2021 to 2025.

Evolution of programming development and research landscapes in AI-powered code generation.

Read 17 tweets

Rohan Paul

@rohanpaul_ai

Dec 3, 2025

Overview of the evolution of code large language models (Code-LLMs) and related ecosystems from 2021 to 2025.

Evolution of programming development and research landscapes in AI-powered code generation.

Read 17 tweets

Share this page!

Enter URL or ID to Unroll

Rohan Paul

Try unrolling a thread yourself!

More from @rohanpaul_ai

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Rohan Paul

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!