Latest Twitter Threads by @jxmnop on Thread Reader App

Aug 13 • 14 tweets • 5 min read

OpenAI hasn’t open-sourced a base model since GPT-2 in 2019. they recently released GPT-OSS, which is reasoning-only...

or is it?

turns out that underneath the surface, there is still a strong base model. so we extracted it.

introducing gpt-oss-20b-base 🧵

if you're not familiar with base models: here are some samples comparing our new model to the original!

we basically reversed the alignment part of LLM training, so we have something that produces natural-looking text again.

the outputs can be pretty random 🤷‍♂️

Aug 8 • 14 tweets • 5 min read

curious about the training data of OpenAI's new gpt-oss models? i was too.

so i generated 10M examples from gpt-oss-20b, ran some analysis, and the results were... pretty bizarre

time for a deep dive 🧵

here's a map of the embedded generations

the model loves math and code. i prompt with nothing and yet it always reasons. it just talks about math and code, and mostly in English

math – probability, ML, PDEs, topology, diffeq
code – agentic software, competitive programming, data science

Jun 24 • 6 tweets • 3 min read

In the beginning, there was BERT.

Eventually BERT gave rise to RoBERTa. Then, DeBERTa. Later, ModernBERT.

And now, NeoBERT. The new state-of-the-art small-sized encoder:

the key insight, i think, is using an optimal depth-to-width ratio for the transformer architecture. and training on good data. a lot of good data.

even though NeoBERT has slightly more parameters, it's still faster AND more effective than ModernBERT for long sequences:

Jun 20 • 9 tweets • 4 min read

NEW RESEARCH: Approximating Language Model Training Data from Weights

ever wonder how much information is available in an open-weights model?

DeepSeek R1 weights are 1.2 TB...

what can we learn from all those bits?

our method reverses LLM finetuning to recover data: 🧵

to do this, you need TWO sets of model weights: the initial model and a finetune

this is realistic. open-weights models often come with two checkpoints

instead of one-shot generating data from weights, we select data from the web with gradients that point along the model diff

Jun 3 • 10 tweets • 5 min read

new paper from our work at Meta!

**GPT-style language models memorize 3.6 bits per param**

we compute capacity by measuring total bits memorized, using some theory from Shannon (1953)

shockingly, the memorization-datasize curves look like this:
___________
/
/

(🧵)

this all started from a quest to come up with a proper measurement of model memorization

it's hard to compute *per-example* memorization, because models "share" info between datapoints

so we start with random uniform strings, where sharing isn't possible. and we get this:

May 21 • 8 tweets • 3 min read

excited to finally share on arxiv what we've known for a while now:

All Embedding Models Learn The Same Thing

embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data

feels like magic, but it's real:🧵

https://twitter.com/jxmnop/status/1893736235262251289

a lot of past research (relative representations, The Platonic Representation Hypothesis, comparison metrics like CCA, SVCCA, ...) has asserted that once they reach a certain scale, different models learn the same thing

this has been shown using various metrics of comparison

Jan 3 • 4 tweets • 1 min read

no AI here, just the coolest paper i've seen in a while

turns out the way paints mix (blue + red = purple) is much more complicated than how light mixes (blue + red = pink)

they have to use a little bit of nonlinear modeling to capture this, and "add" paints in this nonlinear latent color space

Oct 4, 2024 • 7 tweets • 3 min read

We spent a year developing cde-small-v1, the best BERT-sized text embedding model in the world.

today, we're releasing the model on HuggingFace, along with the paper on ArXiv.

I think our release marks a paradigm shift for text retrieval. let me tell you why👇

Typical text embedding models have two main problems
1. training them is complicated and requires many tricks: giant batches, distillation, hard negatives...
2. the embeddings don't "know" what corpus they will be used in; consequently, all text spans are encoded the same way

Apr 4, 2024 • 6 tweets • 2 min read

New Research:

a lot of talk today about "what happens" inside a language model, since they spend the exact same amount of compute on each token, regardless of difficulty.

we touch on this question on our new theory paper, Do Language Models Plan for Future Tokens?

I think our most crucial finding is that although humans think far ahead while speaking (especially while doing complex reasoning problems) it turns out that transformer language models.... don't seem to do that.

they just predict the next token.

Share this page!

Enter URL or ID to Unroll