Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

jack morris

@jxmnop

May 21 • 8 tweets • 3 min read • Read on X

Scrolly

https://twitter.com/jxmnop/status/1893736235262251289

excited to finally share on arxiv what we've known for a while now:

All Embedding Models Learn The Same Thing

embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data

feels like magic, but it's real:🧵

https://twitter.com/jxmnop/status/1893736235262251289

a lot of past research (relative representations, The Platonic Representation Hypothesis, comparison metrics like CCA, SVCCA, ...) has asserted that once they reach a certain scale, different models learn the same thing

this has been shown using various metrics of comparison

we take things a step further. if models E1 and E2 are learning 'similar' representations, what if we were able to actually align them?

and can we do this with just random samples from E1 and E2, by matching their structure?

we take inspiration from 2017 GAN papers that aligned pictures of horses and zebras...

so yes, we're using a GAN. adversarial loss (to align representations) and cycle consistency loss (to make sure we align the *right* representations)

and it works. here's embeddings from GTR (a T5-based model) and GTE (a BERT-based model), after training our GAN for 50 epochs:

theoretically, the implications of this seem big. we call it The Strong Platonic Representation Hypothesis:

models of a certain scale learn representations that are so similar that we can learn to translate between them, using *no* paired data (just our version of CycleGAN)

and practically, this is bad for vector databases. this means that even if you fine-tune your own model, and keep the model secret, someone with access to embeddings alone can decode their text

embedding inversion without model access 😬

this was joint work with my friends @rishi_d_jha, collin zhang, and vitaly shmatikov at Cornell Tech

our paper "Harnessing the Universal Geometry of Embeddings" is on ArXiv today: arxiv.org/abs/2505.12540

https://x.com/rishi_d_jha/status/1925212069168910340

check out Rishi's thread for more info, and follow him for additional research updates!

https://x.com/rishi_d_jha/status/1925212069168910340

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @jxmnop

jack morris

@jxmnop

Jan 3

no AI here, just the coolest paper i've seen in a while

turns out the way paints mix (blue + red = purple) is much more complicated than how light mixes (blue + red = pink)

they have to use a little bit of nonlinear modeling to capture this, and "add" paints in this nonlinear latent color space

here's the link

it's software tooscrtwpns.com/mixbox.pdf

Read 4 tweets

jack morris

@jxmnop

Oct 4, 2024

We spent a year developing cde-small-v1, the best BERT-sized text embedding model in the world.

today, we're releasing the model on HuggingFace, along with the paper on ArXiv.

I think our release marks a paradigm shift for text retrieval. let me tell you why👇

Typical text embedding models have two main problems
1. training them is complicated and requires many tricks: giant batches, distillation, hard negatives...
2. the embeddings don't "know" what corpus they will be used in; consequently, all text spans are encoded the same way

To fix (1) we develop a new training technique: contextual batching. all batches share a lot of context – one batch might be about horse races in Kentucky, the next batch about differential equations, etc.

this lets us get better performance without big batches or hard negative mining. there's also some cool theory behind it

Read 7 tweets

jack morris

@jxmnop

Apr 4, 2024

New Research:

a lot of talk today about "what happens" inside a language model, since they spend the exact same amount of compute on each token, regardless of difficulty.

we touch on this question on our new theory paper, Do Language Models Plan for Future Tokens?

I think our most crucial finding is that although humans think far ahead while speaking (especially while doing complex reasoning problems) it turns out that transformer language models.... don't seem to do that.

they just predict the next token.

it's not that they *can't* plan ahead. as we note, transformers are actually encouraged to make information useful for future states, since the loss at a given position is computed using hidden states at previous positions

and in a fake problem where they have to cache information like this, transformers are able to do so

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

jack morris

Try unrolling a thread yourself!

More from @jxmnop

jack morris

jack morris

jack morris

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!