**GPT-style language models memorize 3.6 bits per param**
we compute capacity by measuring total bits memorized, using some theory from Shannon (1953)
shockingly, the memorization-datasize curves look like this:
___________
/
/
(🧵)
this all started from a quest to come up with a proper measurement of model memorization
it's hard to compute *per-example* memorization, because models "share" info between datapoints
so we start with random uniform strings, where sharing isn't possible. and we get this:
we then compute the capacity of different models
(GPT models with varying numbers of layers and hidden dimensions)
averaged over hundreds of models in fp32, we get the following curve, indicating a linear trend of around 3.6 bits-per-parameter, regardless of the exact details:
we train all of our models until they "saturate" which usually happens around 1M steps using a very large batch size
models memorize the same amount, regardless of training datasize
meaning they have fixed capacity and instead "spread it thinner" when trained on more examples
this gives a pretty good explanation into how models learn
in particular, it explains grokking
grokking occurs *exactly* when capacity saturates. this is where models can't perfectly fit every training example, so they have to share info bt examples in a smart way
we also compute capacity in bf16 and it drops a bit, to 3.5ish.
but that's a relative increase in bitwise usage (11% of bits or so to 22% of bits)
(my first thought was that transformers are doing a bad job of using params efficiently, but now im not sure. it's not *that* bad)
when we train on text data, the curves look different
models memorize examples to the extent that they can fit them in their parameters
beyond this point, the models discard per-example mem. in favor of shared info (*generalization*)
see how the lines start to slope downward:
- hrunning these experiments in a clean setting with perfectly deduplicated texts tells us a lot about privacy:
- once capacity is sufficiently saturated, the **test examples** are slightly more extractable than the training examples -- maybe extraction is a bit of a myth?
- the most extracted examples are the ones with really rare tokens, typically data from other languages that slipped into the training set
- membership inference is much easier than extraction
and finally we can compute membership inference success rate across all our models, ending up with this scaling law 👇
main takeaway: models trained on massive datasets (e.g. every LLM that comes out) can't memorize their training data
there's simply not enough capacity
this was a really fun project with lots of collaborators across various institutions. it took a long time but was definitely worth it, and i learned a lot!
also thanks to everyone who gave us feedback along the way :-)
Eventually BERT gave rise to RoBERTa. Then, DeBERTa. Later, ModernBERT.
And now, NeoBERT. The new state-of-the-art small-sized encoder:
the key insight, i think, is using an optimal depth-to-width ratio for the transformer architecture. and training on good data. a lot of good data.
even though NeoBERT has slightly more parameters, it's still faster AND more effective than ModernBERT for long sequences:
like many important advancements in deep learning, NeoBERT arose from running lots of tiny experiments, learning from them, and stacking the results together into something that works really well:
NEW RESEARCH: Approximating Language Model Training Data from Weights
ever wonder how much information is available in an open-weights model?
DeepSeek R1 weights are 1.2 TB...
what can we learn from all those bits?
our method reverses LLM finetuning to recover data: 🧵
to do this, you need TWO sets of model weights: the initial model and a finetune
this is realistic. open-weights models often come with two checkpoints
instead of one-shot generating data from weights, we select data from the web with gradients that point along the model diff
our algorithm is a bit complicated, mostly because computing per-example gradients is hard to do at scale
so we make some efficiency improvements:
- computing grads w vmap
- only using last-layer grads (which are still big, in the case of LMs)
- projecting them to a smaller dim
a lot of past research (relative representations, The Platonic Representation Hypothesis, comparison metrics like CCA, SVCCA, ...) has asserted that once they reach a certain scale, different models learn the same thing
this has been shown using various metrics of comparison
we take things a step further. if models E1 and E2 are learning 'similar' representations, what if we were able to actually align them?
and can we do this with just random samples from E1 and E2, by matching their structure?
we take inspiration from 2017 GAN papers that aligned pictures of horses and zebras...
We spent a year developing cde-small-v1, the best BERT-sized text embedding model in the world.
today, we're releasing the model on HuggingFace, along with the paper on ArXiv.
I think our release marks a paradigm shift for text retrieval. let me tell you why👇
Typical text embedding models have two main problems 1. training them is complicated and requires many tricks: giant batches, distillation, hard negatives... 2. the embeddings don't "know" what corpus they will be used in; consequently, all text spans are encoded the same way
To fix (1) we develop a new training technique: contextual batching. all batches share a lot of context – one batch might be about horse races in Kentucky, the next batch about differential equations, etc.
this lets us get better performance without big batches or hard negative mining. there's also some cool theory behind it
a lot of talk today about "what happens" inside a language model, since they spend the exact same amount of compute on each token, regardless of difficulty.
we touch on this question on our new theory paper, Do Language Models Plan for Future Tokens?
I think our most crucial finding is that although humans think far ahead while speaking (especially while doing complex reasoning problems) it turns out that transformer language models.... don't seem to do that.
they just predict the next token.
it's not that they *can't* plan ahead. as we note, transformers are actually encouraged to make information useful for future states, since the loss at a given position is computed using hidden states at previous positions
and in a fake problem where they have to cache information like this, transformers are able to do so