Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

jack morris

@jxmnop

Aug 13 • 14 tweets • 5 min read • Read on X

OpenAI hasn’t open-sourced a base model since GPT-2 in 2019. they recently released GPT-OSS, which is reasoning-only...

or is it?

turns out that underneath the surface, there is still a strong base model. so we extracted it.

introducing gpt-oss-20b-base 🧵

if you're not familiar with base models: here are some samples comparing our new model to the original!

we basically reversed the alignment part of LLM training, so we have something that produces natural-looking text again.

the outputs can be pretty random 🤷‍♂️

ALIGNMENT

turning gpt-oss back into a base model appears to have trivially reversed its alignment

it will tell us how to build a bomb. it will list all the curse words it knows. it will plan a robbery for me.

MEMORIZATION

after basemodelization, we can trivially test GPT-OSS for memorization by prompting it with strings from copyrighted materials and checking the outputs

in my short tests i found 3/6 excerpts from books to be memorized 😳

gpt-oss *definitely* knows harry potter...

some backstory:

so last thursday and friday night i was trying the wrong approach, jailbreaking

i wanted to discover a prompt that would trick the model into becoming a base model again

this might be possible but seems really hard; the surface-level alignment is pretty strong

the other day i was chatting with @johnschulman2 and received an excellent suggestion:

why not frame this 'alignment reversal' as optimization?

we can use a subset of web text to search for the smallest possible model update that makes gpt-oss behave as a base model

Principle 1. Low-rankedness
there’s a commonly shared idea that pretraining is how all the information is stored in model weights, and alignment/RL simply focuses the output distribution on a very narrow subset of outputs that are good for conversation (and reasoning)

so if this is true, then the gpt-oss model is only a small update away from its original pretrained model weights

in other words: there exists some sufficiently low-rank update in the direction of pretraining that can “reverse” the post-training process

Principle 2. Data Agnosticism

additionally we need to remember that we’re trying restore the capability of the original model–NOT continue pretraining it. we don’t want the model to learn anything new. we want it to enable freetext generation again

so it doesn’t matter what data we use as long as it’s resemblant of typical pretraining. i chose FineWeb because it’s open relatively high-quality and i already had it downloaded. we only use 20,000 documents or so

so practically we apply a very tiny low-rank LoRA to just a few linear layers and train with data of the form “ ….” as in typical pretraining.

https://x.com/jxmnop/status/1954931353939501461

by the way, the open tools for finetuning arbitrary MoEs are horrible

i ended up using HF but it can only train in bf16 and crashes often. so i wrote a harness that checkpoints training frequently and skips batches that activate too many experts and OOM

this worked. but i felt bad about it.

https://x.com/jxmnop/status/1954931353939501461

now go download the model on HF

prompt it! align it! finetune it!

if you notice any bugs, letmeknow

huggingface.co/jxm/gpt-oss-20…

> disclaimer
i have no idea about the real training of gpt-oss, i dont work for openAI, i have no insider info whatsoever

> assumptions
they probably did standard pre, mid, and post-training with open and synthetic data

PS. here are two weird things that surprised me

1. the alignment still holds when you trick the base model into acting like an assistant by writing "Human: ... Assistant: ..."

2. somehow the model still can *be* an assistant if you go back to using the chat template. it still reasons fine

i guess LoRA really is low-rank...

thanks @johnschulman2 for the great idea and thanks @srush_nlp for the GPUS :-)

some fun future work
- generate from this model to check more thoroughly for memorization
- try the 120B version
- try instruction-tuning
- compare to other base models via 'model diffing'
- compare to GPT-{2, 3}

@johnschulman2 @srush_nlp what do u think @sama?

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @jxmnop

jack morris

@jxmnop

Aug 8

curious about the training data of OpenAI's new gpt-oss models? i was too.

so i generated 10M examples from gpt-oss-20b, ran some analysis, and the results were... pretty bizarre

time for a deep dive 🧵

here's a map of the embedded generations

the model loves math and code. i prompt with nothing and yet it always reasons. it just talks about math and code, and mostly in English

math – probability, ML, PDEs, topology, diffeq
code – agentic software, competitive programming, data science

first thing to notice is that practically none of the generations resemble natural webtext. but surprisingly none of them look like normal chatbot interactions either

this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.

Read 14 tweets

jack morris

@jxmnop

Jun 24

In the beginning, there was BERT.

Eventually BERT gave rise to RoBERTa. Then, DeBERTa. Later, ModernBERT.

And now, NeoBERT. The new state-of-the-art small-sized encoder:

the key insight, i think, is using an optimal depth-to-width ratio for the transformer architecture. and training on good data. a lot of good data.

even though NeoBERT has slightly more parameters, it's still faster AND more effective than ModernBERT for long sequences:

like many important advancements in deep learning, NeoBERT arose from running lots of tiny experiments, learning from them, and stacking the results together into something that works really well:

Read 6 tweets

jack morris

@jxmnop

Jun 20

NEW RESEARCH: Approximating Language Model Training Data from Weights

ever wonder how much information is available in an open-weights model?

DeepSeek R1 weights are 1.2 TB...

what can we learn from all those bits?

our method reverses LLM finetuning to recover data: 🧵

to do this, you need TWO sets of model weights: the initial model and a finetune

this is realistic. open-weights models often come with two checkpoints

instead of one-shot generating data from weights, we select data from the web with gradients that point along the model diff

our algorithm is a bit complicated, mostly because computing per-example gradients is hard to do at scale

so we make some efficiency improvements:
- computing grads w vmap
- only using last-layer grads (which are still big, in the case of LMs)
- projecting them to a smaller dim

Read 9 tweets

jack morris

@jxmnop

Jun 3

new paper from our work at Meta!

**GPT-style language models memorize 3.6 bits per param**

we compute capacity by measuring total bits memorized, using some theory from Shannon (1953)

shockingly, the memorization-datasize curves look like this:
___________
/
/

(🧵)

this all started from a quest to come up with a proper measurement of model memorization

it's hard to compute *per-example* memorization, because models "share" info between datapoints

so we start with random uniform strings, where sharing isn't possible. and we get this:

we then compute the capacity of different models
(GPT models with varying numbers of layers and hidden dimensions)

averaged over hundreds of models in fp32, we get the following curve, indicating a linear trend of around 3.6 bits-per-parameter, regardless of the exact details:

Read 10 tweets

jack morris

@jxmnop

May 21

https://twitter.com/jxmnop/status/1893736235262251289

excited to finally share on arxiv what we've known for a while now:

All Embedding Models Learn The Same Thing

embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data

feels like magic, but it's real:🧵

https://twitter.com/jxmnop/status/1893736235262251289

a lot of past research (relative representations, The Platonic Representation Hypothesis, comparison metrics like CCA, SVCCA, ...) has asserted that once they reach a certain scale, different models learn the same thing

this has been shown using various metrics of comparison

we take things a step further. if models E1 and E2 are learning 'similar' representations, what if we were able to actually align them?

and can we do this with just random samples from E1 and E2, by matching their structure?

we take inspiration from 2017 GAN papers that aligned pictures of horses and zebras...

Read 8 tweets

jack morris

@jxmnop

Jan 3

no AI here, just the coolest paper i've seen in a while

turns out the way paints mix (blue + red = purple) is much more complicated than how light mixes (blue + red = pink)

they have to use a little bit of nonlinear modeling to capture this, and "add" paints in this nonlinear latent color space

here's the link

it's software tooscrtwpns.com/mixbox.pdf

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

jack morris

Try unrolling a thread yourself!

More from @jxmnop

jack morris

jack morris

jack morris

jack morris

jack morris

jack morris

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!