Tweet

Nils Reimers

Jan 28 • 7 tweets • 4 min read

@OpenAI

GPT-3 Embeddings by @OpenAI was announced this week.

📈 I was excited and tested them on 20 datasets
😢 Sadly they are worse than open models that are 1000 x smaller
💰 Running @OpenAI models can be a 1 million times more expensive

tinyurl.com/gpt3-emb

I tested the text similarity models on 14 datasets from different domains (emails, papers, online communities) on various tasks (clustering, retrieval, paraphrase mining).

The 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.

Next, I tested the text-search models. Here the results look well for a dense model.

However, when compared to the state-of-the-art sparse model of SpladeV2, which is 2600x smaller, you just get an 0.1 improvement.

💰 Encoding costs? $1,000,000 for GPT-3 vs. $3 for SpladeV2

When evaluated on 6 (query/questions, paragraph)-tasks, the OpenAI 2.7B & 6.7B parameter models perform on par with an open 110M parameter model (MPNet). Again, encoding costs are about 1000 higher.

@OpenAI

The @OpenAI embedding models produce extremely high dimensional vector spaces of up to 12288 dimensions.

The issue: With more dimensions, your machine requires a lot more memory ($$$) to host such a vector space and operations like search is a lot slower.

My advice:
💰 Safe the $1,000,000 you would need to spend to encode your corpus with GPT-3
📄 Spent $1000 and annotate task specific data
🆓Fine-tune an open model
🎉 Use the $999,000 saving to treat your team

You can find the full analysis, further details, more results & explanations, and links to the alternative open models in the blog post:

tinyurl.com/gpt3-emb

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Nils_Reimers

Nils Reimers

@Nils_Reimers

Sep 8, 2021

🚨Model Alert🚨
🏋️‍♂️ State-of-the-art sentence & paragraph embedding models
🍻State-of-the-art semantic search models
🔢State-of-the-art on MS MARCO for dense retrieval
📂1.2B training pairs corpus
👩‍🎓215M Q&A-training pairs
🌐Everything Available: SBERT.net
🧵

@huggingface

🚨 All-Purpose Sentence & Paragraph Embeddings Models
As part of the JAX community week from @huggingface we collected a corpus of 1.2 billion training pairs => Great embeddings for sentences & paragraphs

Models: sbert.net/docs/pretraine…
Training Data: huggingface.co/datasets/sente…

🚨Semantic Search Models
Performing well on out-of-domain data is challenging for neural retrieval models. By training on 215 million (question, answer)-pairs, we get models that generalize well across domains.

Models: sbert.net/docs/pretraine…
Data: huggingface.co/sentence-trans…

Read 9 tweets

Nils Reimers

@Nils_Reimers

Jun 22, 2021

📺How to train state-of-the-art sentence embeddings? 📺

Just uploaded my 3-part video series on the theory how to train state-of-the-art sentence embedding models:

📺 Part 1 - Applications & Definition
- Why do we need dense representation?
- Definition of dense representation
- What does "semantically similar" mean?
- Applications: Clustering, Search, Zero- & Few-Shot-Classification...

📺 Part 2 - Applications & Definition
- Basic Training Setup
- Loss-Functions: Contrastive Loss, Triplet Loss, Batch Hard Triplet Loss, Multiple Negatives Ranking Loss
- Training with hard negatives for semantic search
- Mining of hard negatives

Read 5 tweets

Nils Reimers

@Nils_Reimers

Jan 27, 2021

@huggingface

New Project: EasyNMT (github.com/UKPLab/EasyNMT)

Easy-to-use, state-of-the-art Neural Machine Translation using @huggingface and @fairseq.

- Translation for 150+ languages
- Sentence & document translation
- Automatic Language Detection

Colab example: colab.research.google.com/drive/1X47vgSi…

@HelsinkiNLP

Currently 4 state-of-the-art models are supported:
- OPUS-MT models from @HelsinkiNLP (individual models for 150+languages)
- mBART50 many-to-many translation for 50 langs from @facebookai
- m2m_100 many-to-many translation for 100 langs from @facebookai (418M and 1.2B version)

Document translation: Transformer-based models have a length limit of 512 / 1024 word pieces.

EasyNMT is able to translate documents of any lengths by splitting it into smaller chunks, translating these, and then reconstructing the full document.

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Nils Reimers

Try unrolling a thread yourself!

More from @Nils_Reimers

Nils Reimers

Nils Reimers

Nils Reimers

Did Thread Reader help you today?

Like this author's thread?