a lot of past research (relative representations, The Platonic Representation Hypothesis, comparison metrics like CCA, SVCCA, ...) has asserted that once they reach a certain scale, different models learn the same thing
this has been shown using various metrics of comparison
we take things a step further. if models E1 and E2 are learning 'similar' representations, what if we were able to actually align them?
and can we do this with just random samples from E1 and E2, by matching their structure?
we take inspiration from 2017 GAN papers that aligned pictures of horses and zebras...
so yes, we're using a GAN. adversarial loss (to align representations) and cycle consistency loss (to make sure we align the *right* representations)
and it works. here's embeddings from GTR (a T5-based model) and GTE (a BERT-based model), after training our GAN for 50 epochs:
theoretically, the implications of this seem big. we call it The Strong Platonic Representation Hypothesis:
models of a certain scale learn representations that are so similar that we can learn to translate between them, using *no* paired data (just our version of CycleGAN)
and practically, this is bad for vector databases. this means that even if you fine-tune your own model, and keep the model secret, someone with access to embeddings alone can decode their text
embedding inversion without model access 😬
this was joint work with my friends @rishi_d_jha, collin zhang, and vitaly shmatikov at Cornell Tech
our paper "Harnessing the Universal Geometry of Embeddings" is on ArXiv today: arxiv.org/abs/2505.12540
check out Rishi's thread for more info, and follow him for additional research updates!
curious about the training data of OpenAI's new gpt-oss models? i was too.
so i generated 10M examples from gpt-oss-20b, ran some analysis, and the results were... pretty bizarre
time for a deep dive 🧵
here's a map of the embedded generations
the model loves math and code. i prompt with nothing and yet it always reasons. it just talks about math and code, and mostly in English
math – probability, ML, PDEs, topology, diffeq
code – agentic software, competitive programming, data science
first thing to notice is that practically none of the generations resemble natural webtext. but surprisingly none of them look like normal chatbot interactions either
this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.
Eventually BERT gave rise to RoBERTa. Then, DeBERTa. Later, ModernBERT.
And now, NeoBERT. The new state-of-the-art small-sized encoder:
the key insight, i think, is using an optimal depth-to-width ratio for the transformer architecture. and training on good data. a lot of good data.
even though NeoBERT has slightly more parameters, it's still faster AND more effective than ModernBERT for long sequences:
like many important advancements in deep learning, NeoBERT arose from running lots of tiny experiments, learning from them, and stacking the results together into something that works really well:
NEW RESEARCH: Approximating Language Model Training Data from Weights
ever wonder how much information is available in an open-weights model?
DeepSeek R1 weights are 1.2 TB...
what can we learn from all those bits?
our method reverses LLM finetuning to recover data: 🧵
to do this, you need TWO sets of model weights: the initial model and a finetune
this is realistic. open-weights models often come with two checkpoints
instead of one-shot generating data from weights, we select data from the web with gradients that point along the model diff
our algorithm is a bit complicated, mostly because computing per-example gradients is hard to do at scale
so we make some efficiency improvements:
- computing grads w vmap
- only using last-layer grads (which are still big, in the case of LMs)
- projecting them to a smaller dim
**GPT-style language models memorize 3.6 bits per param**
we compute capacity by measuring total bits memorized, using some theory from Shannon (1953)
shockingly, the memorization-datasize curves look like this:
___________
/
/
(🧵)
this all started from a quest to come up with a proper measurement of model memorization
it's hard to compute *per-example* memorization, because models "share" info between datapoints
so we start with random uniform strings, where sharing isn't possible. and we get this:
we then compute the capacity of different models
(GPT models with varying numbers of layers and hidden dimensions)
averaged over hundreds of models in fp32, we get the following curve, indicating a linear trend of around 3.6 bits-per-parameter, regardless of the exact details: