1/ Backprop is the engine of deep learning, but neuroscientists have insisted for decades that the brain can't do it. There are no dedicated "error" neurons or backward wiring.
What if the brain doesn't compute error in space, but in time? 🧵
2/ In "This is how the Neocortex Learns," Randall C. O'Reilly presents a unified theory showing how the mammalian brain approximates backpropagation.
It uses temporal differences across a 200 ms cycle to bypass the need for explicit error-representing cells.
3/ The math relies on an implicit error state:
Error ≈ Activation(plus) - Activation(minus)
Instead of separate error neurons, the same cortical cells represent predictions (minus phase) and outcomes (plus phase) at different moments, driven by bidirectional pathways.
4/ This temporal phasing is coordinated by the corticothalamic loop over a 200 ms theta cycle.
Phase 1 (100 ms): Top-down layer 6 predictions settle.
Phase 2 (100 ms): Strong, focal layer 5b driver inputs override predictions with the actual sensory outcome.
5/ How does a physical synapse compute this?
Through a competitive, double-kinase pathway (CaMKII vs DAPK1) that integrates post-synaptic calcium.
6/ Recent in vitro tests support this over classical Hebbian learning.
A flat 50-50 Hz stimulation profile yields zero net plasticity. But a 25-50 Hz transition triggers robust LTP.
The synapse computes the derivative of activity, not just raw co-activity.
7/ The bottlenecks?
We still don't fully map the exact driving targets for deep layer 5 cortical output neurons.
More importantly, while this runs in WebGPU-based spiking networks, we haven't seen it scale to massive, modern deep learning benchmarks yet.
8/ For neuromorphic hardware, this is a goldmine.
It offers a mathematically rigorous, local learning rule that completely eliminates the memory-heavy global backward pass.
We can build ultra-low-power, on-chip continuous learning systems using physical silicon.
9/ I think this work bridges the gap between biological plausibility and deep learning performance. It proves gradient descent isn't just an artificial trick—it's likely how the brain actually optimizes its representations.
1/ Why does the Muon optimizer train LLMs 2x faster than Adam?
It isn't because Muon finds "better" directions of steep descent.
It's because Adam constantly runs head-first into massive second-order curvature penalties, paying a steep "curvature tax."
Let's dive in. 🧵
2/ In "Why Muon Outperforms Adam: A Curvature Perspective," Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, and Zhuoran Yang deconstruct the LLM optimization landscape to find out why spectral normalization works so well.
3/ Every optimization step is a balance.
Loss change is gradient gain minus curvature penalty:
ΔL ≈ ⟨G, Z⟩ - 0.5 * ⟨Z, H[Z]⟩
Tracking this on a 124M LLM shows the first-order gains of Adam and Muon are identical.
The entire performance gap is Adam's massive curvature penalty.
1/ When multi-agent LLM systems debate, they are not just sharing ideas. They are implicitly running a dynamic routing algorithm.
The debate itself acts as a Mixture of Experts (MoE) gatekeeper, shifting influence based on agent confidence. 🧵
2/ In "Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?", Franka Bause, Jonas Niederle, Martin Pawelczyk, and Rebekka Burkholz model agent debates using social opinion dynamics.
Here is how conversational routing actually works.
3/ The authors map LLM debates to the Friedkin-Johnsen (FJ) model of opinion dynamics.
An agent's belief update is a balance of:
• Stubbornness (attachment to its original answer)
• Peer influence (openness to other agents)
1/ Can we have a natural language conversation with a frozen, non-neural biological system?
Not by rewriting its DNA, but by treating its raw physics as a reinforcement learning agent. Here is how we translate human prompts into biological actions. 🧵
2/ The paper Language Game: Talking to Non-Human Systems by Yanbo Zhang and Michael Levin (@drmichaellevin) bypasses bottom-up micromanagement.
Instead of editing gene networks, they wrap frozen dynamical systems in trained linear interfaces to communicate.
3/ The core architecture is a composite RL policy:
pi(s) = D[f(E(s))]
• An encoder (E) maps environment states to physical concentrations.
• A frozen biological ODE (f) computes physical gradients.
• A decoder (D) translates gradients into actions.
1/ Looped transformers offer extreme parameter efficiency, but their quadratic self-attention kills long-context scalability.
What if you swapped attention for subquadratic mixers?
It turns out looping doesn't just save parameters—it actively multiplies linear-time expressivity. 🧵
2/ Introducing LT2: Linear-Time Looped Transformers by Chunyuan Deng, Yizhe Zhang, @eugene_ng, Hanjie Chen, et al.
They replace heavy softmax attention with subquadratic primitives, breaking the KV-cache bottleneck while keeping the reasoning benefits of deep weight recurrence.
3/ The math here is beautiful. In linear attention, a single pass updates the state with a rank-1 correction.
By unrolling the block for T loops with orthogonal projection keys, the transition operator expands into a rank-T update.
1/ Forcing LLMs to reason in English tokens is a massive structural bottleneck. Next-gen models won't "think" in text at all. They will reason natively in continuous latent space. 🧵
2/ Yu et al. just dropped The Latent Space, a massive survey formalizing the shift from discrete token decoding to machine-native continuous computation. It maps the architectures making this possible.
3/ The discretization bottleneck is real. Mapping high-dimensional visual priors or complex reasoning paths into a discrete vocabulary space causes severe semantic loss and high sequential decoding latency.