Grigory Sapunov Profile picture
PhD in AI | GDE in AI/ML | CTO Intento | Author "Deep Learning with JAX" 📝 ML insights: https://t.co/ySSOXJKL7H 🤖 Daily AI paper reviews: https://t.co/yQNYyqTbBR
Jun 16 10 tweets 4 min read
1/ We have been training RNNs wrong for decades.

Backpropagation through time (BPTT) forces sequential updates, creating unstable O(T) gradient paths.

What if we could train highly expressive, non-linear RNNs with flat, parallelized O(1) gradients?

It is now possible. 🧵 Image 2/ In "Pretraining Recurrent Networks without Recurrence", Akarsh Kumar and Phillip Isola bypass BPTT entirely.

They decouple representation learning (what to remember) from transition dynamics (how to update).

The result: O(1) inference with parallel pretraining.
Jun 15 11 tweets 3 min read
1/
Backprop is the engine of deep learning, but neuroscientists have insisted for decades that the brain can't do it. There are no dedicated "error" neurons or backward wiring.

What if the brain doesn't compute error in space, but in time? 🧵 Image 2/
In "This is how the Neocortex Learns," Randall C. O'Reilly presents a unified theory showing how the mammalian brain approximates backpropagation.

It uses temporal differences across a 200 ms cycle to bypass the need for explicit error-representing cells.
Jun 15 11 tweets 3 min read
1/ Standard transformers have a fundamental topological flaw: they cannot track dynamic states over time without running out of layers.

Once a state representation reaches the top layer of the feedforward stack, the model's ability to update its belief collapses. 🧵 Image 2/ This is the core thesis of "The Topological Trouble With Transformers" by Michael C. Mozer, Shoaib Ahmed Siddiqui, and Rosanne Liu.

They expose why purely feedforward networks are topologically incapable of long-term cognitive coherence.
Jun 12 12 tweets 4 min read
1/
Why does the Muon optimizer train LLMs 2x faster than Adam?

It isn't because Muon finds "better" directions of steep descent.

It's because Adam constantly runs head-first into massive second-order curvature penalties, paying a steep "curvature tax."

Let's dive in. 🧵 Image 2/
In "Why Muon Outperforms Adam: A Curvature Perspective," Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, and Zhuoran Yang deconstruct the LLM optimization landscape to find out why spectral normalization works so well.
May 30 11 tweets 4 min read
1/
When multi-agent LLM systems debate, they are not just sharing ideas. They are implicitly running a dynamic routing algorithm.

The debate itself acts as a Mixture of Experts (MoE) gatekeeper, shifting influence based on agent confidence. 🧵 Image 2/
In "Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?", Franka Bause, Jonas Niederle, Martin Pawelczyk, and Rebekka Burkholz model agent debates using social opinion dynamics.

Here is how conversational routing actually works.
May 29 11 tweets 4 min read
1/
Can we have a natural language conversation with a frozen, non-neural biological system?

Not by rewriting its DNA, but by treating its raw physics as a reinforcement learning agent. Here is how we translate human prompts into biological actions. 🧵 Image 2/
The paper Language Game: Talking to Non-Human Systems by Yanbo Zhang and Michael Levin (@drmichaellevin) bypasses bottom-up micromanagement.

Instead of editing gene networks, they wrap frozen dynamical systems in trained linear interfaces to communicate.
May 23 11 tweets 3 min read
1/
Looped transformers offer extreme parameter efficiency, but their quadratic self-attention kills long-context scalability.

What if you swapped attention for subquadratic mixers?

It turns out looping doesn't just save parameters—it actively multiplies linear-time expressivity. 🧵Image 2/
Introducing LT2: Linear-Time Looped Transformers by Chunyuan Deng, Yizhe Zhang, @eugene_ng, Hanjie Chen, et al.

They replace heavy softmax attention with subquadratic primitives, breaking the KV-cache bottleneck while keeping the reasoning benefits of deep weight recurrence.
Apr 14 11 tweets 3 min read
1/ Forcing LLMs to reason in English tokens is a massive structural bottleneck. Next-gen models won't "think" in text at all. They will reason natively in continuous latent space. 🧵 Image 2/ Yu et al. just dropped The Latent Space, a massive survey formalizing the shift from discrete token decoding to machine-native continuous computation. It maps the architectures making this possible. Image
Mar 21 11 tweets 3 min read
1/ Video models understand motion but hallucinate geometry. Image models nail geometry but are blind to motion. We have accepted this tradeoff for years. Meta FAIR just proved it is purely an architectural bug, not a theoretical limit. 🧵 Image 2/ V-JEPA 2.1 by Mur-Labadia, Muckley, and the FAIR team fixes the global-local representation bottleneck. It unifies image and video representation learning into a single encoder. This is a massive step for embodied AI world models. Image
Mar 5 11 tweets 3 min read
1/ LLMs spontaneously form perfect geometric manifolds: circles for months, spirals for timelines. We usually assume this requires deep, complex learning dynamics. A new paper proves it is actually just basic data statistics forcing the math. 🧵 Image 2/ The paper "Symmetry in language statistics shapes the geometry of model representations" by Karkada et al. solves a major interpretability puzzle. It links the shape of the neural code directly to translation symmetry in the training corpus.
Feb 20 11 tweets 4 min read
1/
Standard scaling laws might be inefficient.

New research demonstrates matching GPT-2/Pythia baselines with 37% fewer parameters or 24% fewer training tokens.

The secret? Stop predicting just the next token. Predict the "Next Concept" first. 🧵 Image 2/
Paper: Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models
Authors: Liu et al. (LUMIA Lab)

The premise: Standard Transformers waste compute managing long-range dependencies at the syntax level. ConceptLM adds a latent planning layer.
Feb 17 10 tweets 3 min read
1/
Transformers don't count like computers. We assume they have hidden "registers" to track variables. We were wrong.

New research by @AnthropicAI reverse-engineered Claude 3.5 Haiku and found it works with 6D helical manifolds.

It's geometry, not math. 🧵 Image 2/
Paper: "When Models Manipulate Manifolds"
Context: How does a model receiving *token IDs* track *character lengths* for line-wrapping?

The tokenizer abstracts characters away. To solve this, the model must reconstruct length, accumulate it, and compare against a limit. Image