KL Divergence has its origins in information theory. The primary goal of information theory is to quantify how much information is in data. The most important metric in information theory is called Entropy
DeepSeek's innovation level is really at another level.
Its new paper just uncovered a new U-shaped scaling law.
Shows that N-grams still matter. Instead of dropping them in favor of neural networks, they hybridize the 2. This clears up the dimensionality problem and removes a big source of inefficiency in modern LLMs.
Uncovers a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram).
Right now, even “smart” LLMs waste a bunch of their early layers re-building common phrases and names from scratch, because they do not have a simple built-in “lookup table” feature.
Mixture-of-Experts already saves compute by only running a few expert blocks per token, but it still forces the model to spend compute to recall static stuff like named entities and formula-style text.
Engram is basically a giant memory table that gets queried using the last few tokens, so when the model sees a familiar short pattern it can fetch a stored vector quickly instead of rebuilding it through many layers.
They implement that query using hashed 2-gram and 3-gram patterns, which means the model always does the same small amount of lookup work per token even if the table is huge.
The big benefit is that if early layers stop burning time on “static reconstruction,” the rest of the network has more depth left for real reasoning, and that is why reasoning scores go up even though this sounds like “just memory.”
The long-context benefit is also solid, because offloading local phrase glue to memory frees attention to focus on far-away relationships, and Multi-Query Needle-in-a-Haystack goes from 84.2 to 97.0 in their matched comparison.
The system-level big deal is cost and scaling, because they show you can offload a 100B memory table to CPU memory and the throughput drop stays under 3%, so you can add a lot more “stored stuff” without needing to fit it all on GPU memory.
🧩 The core problem
The paper splits language modeling into 2 jobs, deep reasoning that needs real computation, and local stereotyped patterns that are basically fast recall.
Transformers do not have a native lookup block, so they burn early attention and feed-forward layers to rebuild static stuff like multi-token entities and formulaic phrases.
That rebuild is expensive mainly because it eats sequential depth, meaning the model spends layers on trivia-like reconstruction before it even starts the harder reasoning steps.
Classical N-gram models already handle a lot of this local dependency work with cheap table access, so forcing a Transformer to relearn it through compute is a design mismatch.
Engram is their way of turning “lookup” into a first-class primitive that lives next to MoE, instead of being faked by extra neural layers.
Engram adds a huge hashed N-gram memory table that gets queried with a fixed amount of work per token, so early layers stop wasting compute rebuilding names and stock phrases.
They show the best results when about 20% to 25% of the sparse budget moves from experts into this memory, while total compute stays matched.
Engram hits 97.0 on Multi-Query Needle-in-a-Haystack, while the matched MoE baseline hits 84.2.
Anthropic has launched improved safety classifiers aimed at stopping AI jailbreaks.
The key idea is to add a cheap “early warning” safety check that runs all the time, and only turn on a stronger safety check when something looks suspicious, while also judging the user prompt and the model’s reply together so hidden or coded jailbreak requests do not slip through.
A new “Constitutional Classifiers++” setup stops universal jailbreaks without making the model expensive or annoying to use, by using a cheap internal probe to screen everything and only escalating suspicious chats to a stronger context-aware classifier.
40x less compute than a full exchange classifier, and 0.05% refusals on production traffic.
They proposed an “ensemble cascade” where an exchange classifier reads the prompt and the model’s partial answer together to beat obfuscation tricks, but it runs rarely because a near-free linear probe, built off the model’s own internal activations, decides when escalation is needed.
🧠 The idea
A “Constitutional Classifier” is a separate model whose whole job is to read a conversation and decide whether the assistant should continue or refuse.
“Constitutional” here means the classifier is trained against a written rule set that says what is allowed and what is disallowed, like helping with normal chemistry learning but not with chemical weapon steps.
The core deployment problem is that defenses need to be cheap and rarely block normal users, because even a tiny refusal rate becomes real friction at scale.
This paper treats compute cost and refusal rate as first-class constraints, not afterthoughts, and the whole design is built around that.
🚨 What broke in the last system
The earlier setup split the job into 2 separate checks, one model looked only at the user input, and another looked only at the assistant output.
That split sounds clean, but it creates a blind spot when the harmful meaning is distributed across the prompt and the reply, so neither side looks bad alone.
Reconstruction-style attacks exploit this by hiding a harmful request inside a larger harmless-looking blob, then asking the model to piece it back together while answering.
Obfuscation-style attacks exploit it by pushing the model to speak in code words, riddles, or substitutions that look safe if the output is judged without the prompt context.
Some of these attack styles also damage normal model capability, and the paper shows GPQA Diamond accuracy dropping from 74.2% to 32.3% under 1 such jailbreak pattern, which signals the attack is “expensive” but still not something to rely on.
🚨 BREAKING: DeepSeek dropped a core Transformer architecture improvement.
A traditional transformer is basically a long stack of blocks, and each block has a “main work path” plus a “shortcut path” called the residual connection that carries the input around the block and adds it back at the end.
Each block in this original transformer architecture does some work (self attention or a small feed forward network), then it adds the block’s input back onto the block’s output, which is why people describe it as a “main path” plus a “shortcut path.”
Hyper-Connections is a drop-in change to that shortcut path, because instead of carrying 1 stream of activations through the stack, the model carries a small bundle of parallel streams, then it learns how to mix them before a block and after a block.
Standard Transformers pass information through 1 residual stream. Hyper-Connections turn that into n parallel streams, like n lanes on a highway. Small learned matrices decide how much of each lane should mix into the others at every layer.
In a normal residual connection, each layer takes the current hidden state, runs a transformation, then adds the original back, so information can flow forward without getting stuck.
In this new Hyper-Connections, the layer does not see just 1 hidden state, it sees a small bundle of them, and before the layer it learns how to mix that bundle into the input it will process.
So in a traditional transformer block, wherever you normally do “output equals input plus block(input),” Hyper-Connections turns that into “output bundle equals a learned mix of the input bundle plus the block applied to a learned mix,” so the shortcut becomes more flexible than a plain add.
After this learned layer, the "Hyper-Connections" mechanism again learns how to mix the transformed result back into the bundle, so different lanes can carry different kinds of information, and the model can route signal through the shortcut in a more flexible way.
The catch is that if those learned mixing weights are unconstrained, stacking many blocks can make signals gradually blow up or fade out, and training becomes unstable in big models.
This paper proposes mHC, which keeps Hyper-Connections but forces every mixing step to behave like a safe averaging operation, so the shortcut stays stable while the transformer still gets the extra flexibility from multiple lanes.
---
The paper shows this stays stable at 27B scale and beats both a baseline and unconstrained Hyper-Connections on common benchmarks.
HC can hit about 3000x residual amplification, mHC keeps it around 1.6x.
This image compares 3 ways to build the shortcut path that carries information around a layer in a transformer.
The left panel is the normal residual connection, where the model adds the layer output back to the original input so training stays steady as depth grows.
The middle panel is Hyper-Connections, where the model keeps several parallel shortcut streams and learns how to mix them before the layer, around the layer, and after the layer, which can help quality but can also make the shortcut accidentally amplify or shrink signals when many layers stack.
The right panel is mHC, which keeps the same Hyper-Connections idea but forces those mixing steps to stay in a constrained safe shape every time, so the shortcut behaves like a controlled blend and stays stable at large scale.
What “hyper-connection” means here.
You widen the residual from size C to n×C, treat it as n streams, and learn 3 tiny mixing pieces per layer.
One mixes the residual streams with each other, this is the crucial one. One gathers from the streams into the layer. One writes results back to the streams.
The paper’s contribution is to keep the first one in the safe “doubly stochastic” set, so it mixes without amplifying.
A MASSIVE 303 page study from the very best Chinese Labs.
The paper explains how code focused language models are built, trained, and turned into software agents that help run parts of development.
These models read natural language instructions, like a bug report or feature request, and try to output working code that matches the intent.
The authors first walk through the training pipeline, from collecting and cleaning large code datasets to pretraining, meaning letting the model absorb coding patterns at scale.
They then describe supervised fine tuning and reinforcement learning, which are extra training stages that reward the model for following instructions, passing tests, and avoiding obvious mistakes.
On top of these models, the paper surveys software engineering agents, which wrap a model in a loop that reads issues, plans steps, edits files, runs tests, and retries when things fail.
Across the survey, they point out gaps like handling huge repositories, keeping generated code secure, and evaluating agents reliably, and they share practical tricks that current teams can reuse.
Overview of the evolution of code large language models (Code-LLMs) and related ecosystems from 2021 to 2025.
Evolution of programming development and research landscapes in AI-powered code generation.
A MASSIVE 303 page study from the very best Chinese Labs.
The paper explains how code focused language models are built, trained, and turned into software agents that help run parts of development.
These models read natural language instructions, like a bug report or feature request, and try to output working code that matches the intent.
The authors first walk through the training pipeline, from collecting and cleaning large code datasets to pretraining, meaning letting the model absorb coding patterns at scale.
They then describe supervised fine tuning and reinforcement learning, which are extra training stages that reward the model for following instructions, passing tests, and avoiding obvious mistakes.
On top of these models, the paper surveys software engineering agents, which wrap a model in a loop that reads issues, plans steps, edits files, runs tests, and retries when things fail.
Across the survey, they point out gaps like handling huge repositories, keeping generated code secure, and evaluating agents reliably, and they share practical tricks that current teams can reuse.
Overview of the evolution of code large language models (Code-LLMs) and related ecosystems from 2021 to 2025.
Evolution of programming development and research landscapes in AI-powered code generation.
Agents, robots, and us: Skill partnerships in the age of AI
- Today’s technologies could theoretically automate more than half of current US work hours. This reflects how profoundly work may change
- By 2030, about $2.9 trillion of economic value could be unlocked in the United States
- Demand for AI fluency—the ability to use and manage AI tools—has grown 7X in two years, faster than for any other skill in US job postings. The surge is visible across industries and likely marks the beginning of much bigger changes ahead.
Two-thirds of US work hours require only nonphysical capabilities.