Grigory Sapunov Profile picture
Jun 15 11 tweets 3 min read Read on X
1/
Backprop is the engine of deep learning, but neuroscientists have insisted for decades that the brain can't do it. There are no dedicated "error" neurons or backward wiring.

What if the brain doesn't compute error in space, but in time? 🧵 Image
2/
In "This is how the Neocortex Learns," Randall C. O'Reilly presents a unified theory showing how the mammalian brain approximates backpropagation.

It uses temporal differences across a 200 ms cycle to bypass the need for explicit error-representing cells.
3/
The math relies on an implicit error state:
Error ≈ Activation(plus) - Activation(minus)

Instead of separate error neurons, the same cortical cells represent predictions (minus phase) and outcomes (plus phase) at different moments, driven by bidirectional pathways. Image
4/
This temporal phasing is coordinated by the corticothalamic loop over a 200 ms theta cycle.

Phase 1 (100 ms): Top-down layer 6 predictions settle.
Phase 2 (100 ms): Strong, focal layer 5b driver inputs override predictions with the actual sensory outcome. Image
5/
How does a physical synapse compute this?

Through a competitive, double-kinase pathway (CaMKII vs DAPK1) that integrates post-synaptic calcium.

If calcium influx changes rapidly (positive temporal derivative), CaMKII dominates, driving LTP.
6/
Recent in vitro tests support this over classical Hebbian learning.

A flat 50-50 Hz stimulation profile yields zero net plasticity. But a 25-50 Hz transition triggers robust LTP.

The synapse computes the derivative of activity, not just raw co-activity. Image
7/
The bottlenecks?

We still don't fully map the exact driving targets for deep layer 5 cortical output neurons.

More importantly, while this runs in WebGPU-based spiking networks, we haven't seen it scale to massive, modern deep learning benchmarks yet.
8/
For neuromorphic hardware, this is a goldmine.

It offers a mathematically rigorous, local learning rule that completely eliminates the memory-heavy global backward pass.

We can build ultra-low-power, on-chip continuous learning systems using physical silicon.
9/
I think this work bridges the gap between biological plausibility and deep learning performance. It proves gradient descent isn't just an artificial trick—it's likely how the brain actually optimizes its representations.
10/
Read my full breakdown of O'Reilly's paper:
arxiviq.substack.com/p/this-is-how-…

Original paper here:
arxiv.org/abs/2606.08720

How do you think biological learning rules will impact future AI hardware? Let's discuss below.
11/
Visual summary of the corticothalamic temporal loop mechanism: Image

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Grigory Sapunov

Grigory Sapunov Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @che_shr_cat

Jun 15
1/ Standard transformers have a fundamental topological flaw: they cannot track dynamic states over time without running out of layers.

Once a state representation reaches the top layer of the feedforward stack, the model's ability to update its belief collapses. 🧵 Image
2/ This is the core thesis of "The Topological Trouble With Transformers" by Michael C. Mozer, Shoaib Ahmed Siddiqui, and Rosanne Liu.

They expose why purely feedforward networks are topologically incapable of long-term cognitive coherence.
3/ In a standard decoder, activation flows strictly upward.

If a model resolves a complex concept (like "river bank") at layer 12 in step T, step T+1's early layers cannot access it.

The model must rebuild the state from raw history, leading to logical flips. Image
Read 11 tweets
Jun 12
1/
Why does the Muon optimizer train LLMs 2x faster than Adam?

It isn't because Muon finds "better" directions of steep descent.

It's because Adam constantly runs head-first into massive second-order curvature penalties, paying a steep "curvature tax."

Let's dive in. 🧵 Image
2/
In "Why Muon Outperforms Adam: A Curvature Perspective," Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, and Zhuoran Yang deconstruct the LLM optimization landscape to find out why spectral normalization works so well.
3/
Every optimization step is a balance.
Loss change is gradient gain minus curvature penalty:

ΔL ≈ ⟨G, Z⟩ - 0.5 * ⟨Z, H[Z]⟩

Tracking this on a 124M LLM shows the first-order gains of Adam and Muon are identical.
The entire performance gap is Adam's massive curvature penalty.Image
Read 12 tweets
May 30
1/
When multi-agent LLM systems debate, they are not just sharing ideas. They are implicitly running a dynamic routing algorithm.

The debate itself acts as a Mixture of Experts (MoE) gatekeeper, shifting influence based on agent confidence. 🧵 Image
2/
In "Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?", Franka Bause, Jonas Niederle, Martin Pawelczyk, and Rebekka Burkholz model agent debates using social opinion dynamics.

Here is how conversational routing actually works.
3/
The authors map LLM debates to the Friedkin-Johnsen (FJ) model of opinion dynamics.

An agent's belief update is a balance of:
• Stubbornness (attachment to its original answer)
• Peer influence (openness to other agents)

This is a task-adaptive MoE in disguise. Image
Read 11 tweets
May 29
1/
Can we have a natural language conversation with a frozen, non-neural biological system?

Not by rewriting its DNA, but by treating its raw physics as a reinforcement learning agent. Here is how we translate human prompts into biological actions. 🧵 Image
2/
The paper Language Game: Talking to Non-Human Systems by Yanbo Zhang and Michael Levin (@drmichaellevin) bypasses bottom-up micromanagement.

Instead of editing gene networks, they wrap frozen dynamical systems in trained linear interfaces to communicate.
3/
The core architecture is a composite RL policy:
pi(s) = D[f(E(s))]

• An encoder (E) maps environment states to physical concentrations.
• A frozen biological ODE (f) computes physical gradients.
• A decoder (D) translates gradients into actions. Image
Read 11 tweets
May 23
1/
Looped transformers offer extreme parameter efficiency, but their quadratic self-attention kills long-context scalability.

What if you swapped attention for subquadratic mixers?

It turns out looping doesn't just save parameters—it actively multiplies linear-time expressivity. 🧵Image
2/
Introducing LT2: Linear-Time Looped Transformers by Chunyuan Deng, Yizhe Zhang, @eugene_ng, Hanjie Chen, et al.

They replace heavy softmax attention with subquadratic primitives, breaking the KV-cache bottleneck while keeping the reasoning benefits of deep weight recurrence.
3/
The math here is beautiful. In linear attention, a single pass updates the state with a rank-1 correction.

By unrolling the block for T loops with orthogonal projection keys, the transition operator expands into a rank-T update.

Looping directly scales representational capacity.Image
Read 11 tweets
Apr 14
1/ Forcing LLMs to reason in English tokens is a massive structural bottleneck. Next-gen models won't "think" in text at all. They will reason natively in continuous latent space. 🧵 Image
2/ Yu et al. just dropped The Latent Space, a massive survey formalizing the shift from discrete token decoding to machine-native continuous computation. It maps the architectures making this possible. Image
3/ The discretization bottleneck is real. Mapping high-dimensional visual priors or complex reasoning paths into a discrete vocabulary space causes severe semantic loss and high sequential decoding latency.
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(