jack morris Profile picture
Aug 8 14 tweets 5 min read Read on X
curious about the training data of OpenAI's new gpt-oss models? i was too.

so i generated 10M examples from gpt-oss-20b, ran some analysis, and the results were... pretty bizarre

time for a deep dive 🧵 Image
here's a map of the embedded generations

the model loves math and code. i prompt with nothing and yet it always reasons. it just talks about math and code, and mostly in English

math – probability, ML, PDEs, topology, diffeq
code – agentic software, competitive programming, data scienceImage
Image
first thing to notice is that practically none of the generations resemble natural webtext. but surprisingly none of them look like normal chatbot interactions either

this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.
and it truly is a tortured model. here the model hallucinates a programming problem about dominos and attempts to solve it, spending over 30,000 tokens in the process

completely unprompted, the model generated and tried to solve this domino problem over 5,000 separate times Image
ran a classifier over outputs to get a sense of which programming languages gpt-oss knows

they seem to have trained on nearly everything you've ever heard of. especially a lot of Perl

(btw, from my analysis Java and Kotlin should be way higher. classifier may have gone wrong) Image
what you can't see from the map is many of the chains start in English but slowly descend into Neuralese

the reasoning chains happily alternate between Arabic, Russian, Thai, Korean, Chinese, and Ukrainian. then usually make their way back to English (but not always) Image
Image
the OCR conjecture:

some examples include artifacts such as OCRV ROOT, which indicate the training data may have been

reading between the lines: OpenAI is scanning books

(for some reason the model loves mentioning how many deaf people live in Malaysia) Image
what are some explanations for constant codeswitching?

1. OpenAI has figured out RL. the models no longer speak english
2. data corruption issues via OCR or synthetic training
3. somehow i forced the model to output too many tokens and they gradually shift out of distribution
there are a small number of creative outputs interspersed throughout

here's one example where the model starts writing a sketch for a norwegian screenplay 🤷‍♂️ Image
i also learned a lot from this one.

the model is *really* good at using unicode

...but might be bad at physics. what in the world is a 'superhalo function' Image
if you want to try the data, here you go, it's on huggingface:



let me know what you find! huggingface.co/datasets/jxm/g…Image
FUTURE WORK – deduplication

even though i varied the random seed and used temperature, a lot of the outputs are highly redundant

it would be prudent to deduplicate, i bet there are only 100k or fewer mostly-unique examples here
FUTURE WORK – describing differences

@ZhongRuiqi has some incredible work on methods for describing the difference between two text distributions *in natural language*

we could compare outputs of 20b to the 120b model, or LLAMA, or GPT-5...
FUTURE WORK – direct extraction

we're working on directly extracting training data from models using RL and other methods. we'll be presenting our first work on this in COLM, and expect more in this space

we may be able to directly extract data from the 120b model.. one day 😎

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with jack morris

jack morris Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jxmnop

Jun 24
In the beginning, there was BERT.

Eventually BERT gave rise to RoBERTa. Then, DeBERTa. Later, ModernBERT.

And now, NeoBERT. The new state-of-the-art small-sized encoder: Image
the key insight, i think, is using an optimal depth-to-width ratio for the transformer architecture. and training on good data. a lot of good data.

even though NeoBERT has slightly more parameters, it's still faster AND more effective than ModernBERT for long sequences: Image
like many important advancements in deep learning, NeoBERT arose from running lots of tiny experiments, learning from them, and stacking the results together into something that works really well: Image
Read 6 tweets
Jun 20
NEW RESEARCH: Approximating Language Model Training Data from Weights

ever wonder how much information is available in an open-weights model?

DeepSeek R1 weights are 1.2 TB...

what can we learn from all those bits?

our method reverses LLM finetuning to recover data: 🧵Image
Image
to do this, you need TWO sets of model weights: the initial model and a finetune

this is realistic. open-weights models often come with two checkpoints

instead of one-shot generating data from weights, we select data from the web with gradients that point along the model diff Image
our algorithm is a bit complicated, mostly because computing per-example gradients is hard to do at scale

so we make some efficiency improvements:
- computing grads w vmap
- only using last-layer grads (which are still big, in the case of LMs)
- projecting them to a smaller dim Image
Read 9 tweets
Jun 3
new paper from our work at Meta!

**GPT-style language models memorize 3.6 bits per param**

we compute capacity by measuring total bits memorized, using some theory from Shannon (1953)

shockingly, the memorization-datasize curves look like this:
___________
/
/

(🧵)Image
Image
this all started from a quest to come up with a proper measurement of model memorization

it's hard to compute *per-example* memorization, because models "share" info between datapoints

so we start with random uniform strings, where sharing isn't possible. and we get this: Image
we then compute the capacity of different models
(GPT models with varying numbers of layers and hidden dimensions)

averaged over hundreds of models in fp32, we get the following curve, indicating a linear trend of around 3.6 bits-per-parameter, regardless of the exact details: Image
Read 10 tweets
May 21
excited to finally share on arxiv what we've known for a while now:

All Embedding Models Learn The Same Thing

embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data

feels like magic, but it's real:🧵
a lot of past research (relative representations, The Platonic Representation Hypothesis, comparison metrics like CCA, SVCCA, ...) has asserted that once they reach a certain scale, different models learn the same thing

this has been shown using various metrics of comparison
we take things a step further. if models E1 and E2 are learning 'similar' representations, what if we were able to actually align them?

and can we do this with just random samples from E1 and E2, by matching their structure?

we take inspiration from 2017 GAN papers that aligned pictures of horses and zebras...Image
Read 8 tweets
Jan 3
no AI here, just the coolest paper i've seen in a while Image
turns out the way paints mix (blue + red = purple) is much more complicated than how light mixes (blue + red = pink)

they have to use a little bit of nonlinear modeling to capture this, and "add" paints in this nonlinear latent color space Image
here's the link

it's software tooscrtwpns.com/mixbox.pdf
Read 4 tweets
Oct 4, 2024
We spent a year developing cde-small-v1, the best BERT-sized text embedding model in the world.

today, we're releasing the model on HuggingFace, along with the paper on ArXiv.

I think our release marks a paradigm shift for text retrieval. let me tell you why👇Image
Typical text embedding models have two main problems
1. training them is complicated and requires many tricks: giant batches, distillation, hard negatives...
2. the embeddings don't "know" what corpus they will be used in; consequently, all text spans are encoded the same way
To fix (1) we develop a new training technique: contextual batching. all batches share a lot of context – one batch might be about horse races in Kentucky, the next batch about differential equations, etc.

this lets us get better performance without big batches or hard negative mining. there's also some cool theory behind itImage
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(