Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Grigory Sapunov

@che_shr_cat

Jun 15 • 11 tweets • 3 min read • Read on X

Scrolly

1/
Backprop is the engine of deep learning, but neuroscientists have insisted for decades that the brain can't do it. There are no dedicated "error" neurons or backward wiring.

What if the brain doesn't compute error in space, but in time? 🧵

2/
In "This is how the Neocortex Learns," Randall C. O'Reilly presents a unified theory showing how the mammalian brain approximates backpropagation.

It uses temporal differences across a 200 ms cycle to bypass the need for explicit error-representing cells.

3/
The math relies on an implicit error state:
Error ≈ Activation(plus) - Activation(minus)

Instead of separate error neurons, the same cortical cells represent predictions (minus phase) and outcomes (plus phase) at different moments, driven by bidirectional pathways.

4/
This temporal phasing is coordinated by the corticothalamic loop over a 200 ms theta cycle.

Phase 1 (100 ms): Top-down layer 6 predictions settle.
Phase 2 (100 ms): Strong, focal layer 5b driver inputs override predictions with the actual sensory outcome.

5/
How does a physical synapse compute this?

Through a competitive, double-kinase pathway (CaMKII vs DAPK1) that integrates post-synaptic calcium.

If calcium influx changes rapidly (positive temporal derivative), CaMKII dominates, driving LTP.

6/
Recent in vitro tests support this over classical Hebbian learning.

A flat 50-50 Hz stimulation profile yields zero net plasticity. But a 25-50 Hz transition triggers robust LTP.

The synapse computes the derivative of activity, not just raw co-activity.

7/
The bottlenecks?

We still don't fully map the exact driving targets for deep layer 5 cortical output neurons.

More importantly, while this runs in WebGPU-based spiking networks, we haven't seen it scale to massive, modern deep learning benchmarks yet.

8/
For neuromorphic hardware, this is a goldmine.

It offers a mathematically rigorous, local learning rule that completely eliminates the memory-heavy global backward pass.

We can build ultra-low-power, on-chip continuous learning systems using physical silicon.

9/
I think this work bridges the gap between biological plausibility and deep learning performance. It proves gradient descent isn't just an artificial trick—it's likely how the brain actually optimizes its representations.

10/
Read my full breakdown of O'Reilly's paper:
arxiviq.substack.com/p/this-is-how-…

Original paper here:
arxiv.org/abs/2606.08720

How do you think biological learning rules will impact future AI hardware? Let's discuss below.

11/
Visual summary of the corticothalamic temporal loop mechanism:

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @che_shr_cat

Grigory Sapunov

@che_shr_cat

Jun 15

1/ Standard transformers have a fundamental topological flaw: they cannot track dynamic states over time without running out of layers.

Once a state representation reaches the top layer of the feedforward stack, the model's ability to update its belief collapses. 🧵

2/ This is the core thesis of "The Topological Trouble With Transformers" by Michael C. Mozer, Shoaib Ahmed Siddiqui, and Rosanne Liu.

They expose why purely feedforward networks are topologically incapable of long-term cognitive coherence.

3/ In a standard decoder, activation flows strictly upward.

If a model resolves a complex concept (like "river bank") at layer 12 in step T, step T+1's early layers cannot access it.

The model must rebuild the state from raw history, leading to logical flips.

Read 11 tweets

Grigory Sapunov

@che_shr_cat

Jun 12

1/
Why does the Muon optimizer train LLMs 2x faster than Adam?

It isn't because Muon finds "better" directions of steep descent.

It's because Adam constantly runs head-first into massive second-order curvature penalties, paying a steep "curvature tax."

Let's dive in. 🧵

2/
In "Why Muon Outperforms Adam: A Curvature Perspective," Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, and Zhuoran Yang deconstruct the LLM optimization landscape to find out why spectral normalization works so well.

3/
Every optimization step is a balance.
Loss change is gradient gain minus curvature penalty:

ΔL ≈ ⟨G, Z⟩ - 0.5 * ⟨Z, H[Z]⟩

Tracking this on a 124M LLM shows the first-order gains of Adam and Muon are identical.
The entire performance gap is Adam's massive curvature penalty.

Read 12 tweets

Grigory Sapunov

@che_shr_cat

May 30

1/
When multi-agent LLM systems debate, they are not just sharing ideas. They are implicitly running a dynamic routing algorithm.

The debate itself acts as a Mixture of Experts (MoE) gatekeeper, shifting influence based on agent confidence. 🧵

2/
In "Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?", Franka Bause, Jonas Niederle, Martin Pawelczyk, and Rebekka Burkholz model agent debates using social opinion dynamics.

Here is how conversational routing actually works.

3/
The authors map LLM debates to the Friedkin-Johnsen (FJ) model of opinion dynamics.

An agent's belief update is a balance of:
• Stubbornness (attachment to its original answer)
• Peer influence (openness to other agents)

This is a task-adaptive MoE in disguise.

Read 11 tweets

Grigory Sapunov

@che_shr_cat

May 29

1/
Can we have a natural language conversation with a frozen, non-neural biological system?

Not by rewriting its DNA, but by treating its raw physics as a reinforcement learning agent. Here is how we translate human prompts into biological actions. 🧵

2/
The paper Language Game: Talking to Non-Human Systems by Yanbo Zhang and Michael Levin (@drmichaellevin) bypasses bottom-up micromanagement.

Instead of editing gene networks, they wrap frozen dynamical systems in trained linear interfaces to communicate.

3/
The core architecture is a composite RL policy:
pi(s) = D[f(E(s))]

• An encoder (E) maps environment states to physical concentrations.
• A frozen biological ODE (f) computes physical gradients.
• A decoder (D) translates gradients into actions.

Read 11 tweets

Grigory Sapunov

@che_shr_cat

May 23

1/
Looped transformers offer extreme parameter efficiency, but their quadratic self-attention kills long-context scalability.

What if you swapped attention for subquadratic mixers?

It turns out looping doesn't just save parameters—it actively multiplies linear-time expressivity. 🧵

2/
Introducing LT2: Linear-Time Looped Transformers by Chunyuan Deng, Yizhe Zhang, @eugene_ng, Hanjie Chen, et al.

They replace heavy softmax attention with subquadratic primitives, breaking the KV-cache bottleneck while keeping the reasoning benefits of deep weight recurrence.

3/
The math here is beautiful. In linear attention, a single pass updates the state with a rank-1 correction.

By unrolling the block for T loops with orthogonal projection keys, the transition operator expands into a rank-T update.

Looping directly scales representational capacity.

Read 11 tweets

Grigory Sapunov

@che_shr_cat

Apr 14

1/ Forcing LLMs to reason in English tokens is a massive structural bottleneck. Next-gen models won't "think" in text at all. They will reason natively in continuous latent space. 🧵

2/ Yu et al. just dropped The Latent Space, a massive survey formalizing the shift from discrete token decoding to machine-native continuous computation. It maps the architectures making this possible.

3/ The discretization bottleneck is real. Mapping high-dimensional visual priors or complex reasoning paths into a discrete vocabulary space causes severe semantic loss and high sequential decoding latency.

Read 11 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Grigory Sapunov

Try unrolling a thread yourself!

More from @che_shr_cat

Grigory Sapunov

Grigory Sapunov

Grigory Sapunov

Grigory Sapunov

Grigory Sapunov

Grigory Sapunov

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!