Postdoc at CBS-NTT Program on Physics of Intelligence, Harvard University.
Jun 28 • 16 tweets • 5 min read
🚨New paper! We know models learn distinct in-context learning strategies, but *why*? Why generalize instead of memorize to lower loss? And why is generalization transient?
Our work explains this & *predicts Transformer behavior throughout training* without its weights! 🧵
1/
We first define Bayesian predictors for ICL settings that involve learning a finite mixture of tasks:
🔴 Memorizing (M): discrete prior on seen tasks
🔵 Generalizing (G): continuous prior matching the true task distribution
These match known strategies from prior work!
2/
Jun 6 • 11 tweets • 4 min read
🚨 New paper alert!
Linear representation hypothesis (LRH) argues concepts are encoded as **sparse sum of orthogonal directions**, motivating interpretability tools like SAEs. But what if some concepts don’t fit that mold? Would SAEs capture them? 🤔
1/11
We propose to stress-test SAEs by formalizing LRH and a specific concept structure that lies outside this interpretation: hierarchical concepts that are not linear accessible!
2/11
Nov 10, 2024 • 11 tweets • 5 min read
Paper alert—accepted as a NeurIPS *Spotlight*!🧵👇
We build on our past work relating emergence to task compositionality and analyze the *learning dynamics* of such tasks: we find there exist latent interventions that can elicit them much before input prompting works! 🤯
We first define “concept space”, a coordinate space whose axes denote specific concepts (e.g., color, size). We train diffusion models on a set of concept combinations & map generations of seen / unseen combinations back to the concept space, yielding *concept learning dynamics*!