Nils Reimers Profile picture
Jan 28, 2022 โ€ข 7 tweets โ€ข 4 min read โ€ข Read on X
GPT-3 Embeddings by @OpenAI was announced this week.

๐Ÿ“ˆ I was excited and tested them on 20 datasets
๐Ÿ˜ข Sadly they are worse than open models that are 1000 x smaller
๐Ÿ’ฐ Running @OpenAI models can be a 1 million times more expensive

tinyurl.com/gpt3-emb
I tested the text similarity models on 14 datasets from different domains (emails, papers, online communities) on various tasks (clustering, retrieval, paraphrase mining).

The 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.
Next, I tested the text-search models. Here the results look well for a dense model.

However, when compared to the state-of-the-art sparse model of SpladeV2, which is 2600x smaller, you just get an 0.1 improvement.

๐Ÿ’ฐ Encoding costs? $1,000,000 for GPT-3 vs. $3 for SpladeV2
When evaluated on 6 (query/questions, paragraph)-tasks, the OpenAI 2.7B & 6.7B parameter models perform on par with an open 110M parameter model (MPNet). Again, encoding costs are about 1000 higher.
The @OpenAI embedding models produce extremely high dimensional vector spaces of up to 12288 dimensions.

The issue: With more dimensions, your machine requires a lot more memory ($$$) to host such a vector space and operations like search is a lot slower.
My advice:
๐Ÿ’ฐ Safe the $1,000,000 you would need to spend to encode your corpus with GPT-3
๐Ÿ“„ Spent $1000 and annotate task specific data
๐Ÿ†“Fine-tune an open model
๐ŸŽ‰ Use the $999,000 saving to treat your team
You can find the full analysis, further details, more results & explanations, and links to the alternative open models in the blog post:

tinyurl.com/gpt3-emb

โ€ข โ€ข โ€ข

Missing some Tweet in this thread? You can try to force a refresh
ใ€€

Keep Current with Nils Reimers

Nils Reimers Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Nils_Reimers

Aug 1, 2025
๐„๐ง๐๐Ÿ๐„๐ง๐ ๐•๐ข๐ฌ๐ข๐จ๐ง-๐‘๐€๐† ๐ฐ๐ข๐ญ๐ก ๐‚๐จ๐ก๐ž๐ซ๐ž

Our data is multi-modal ๐Ÿ–ผ๏ธ, but most RAG pipelines are still text-only.

This causes massive problems with complex visual information.

With Cmd-A-Vision from @cohere you now get a sota vision model for Vision-RAG Image
Traditional Text-RAG tries to converts images to markdown. However, it looses a lot of the rich information represented in the image ๐Ÿ˜ก Image
Vision-RAG skips these issues, by operating on the vision domain end2end. No more issues with faulty PDF2Mardown / Image2Markdown. Image
Read 6 tweets
Jul 3, 2024
๐’๐ž๐ฆ๐š๐ง๐ญ๐ข๐œ ๐’๐ž๐š๐ซ๐œ๐ก ๐จ๐ง ๐Ÿ๐ŸŽ๐ŸŽ๐Œ ๐๐จ๐œ๐ฌ - ๐–๐ข๐ญ๐ก ๐Ÿ๐ŸŽ๐ŸŽ๐Œ๐ ๐จ๐Ÿ ๐Œ๐ž๐ฆ๐จ๐ซ๐ฒ

GPU-poor and Memory-poor, and not having 500GB of memory to embed & index 100M docs?

Still want to participate at TREC-RAG 2024?

Introducing ๐ƒ๐ข๐ฌ๐ค๐•๐ž๐œ๐ญ๐จ๐ซ๐ˆ๐ง๐๐ž๐ฑ Image
๐•๐ž๐œ๐ญ๐จ๐ซ ๐’๐ž๐š๐ซ๐œ๐ก ๐Ÿ๐จ๐ซ ๐Œ๐ž๐ฆ๐จ๐ซ๐ฒ-๐๐จ๐จ๐ซ
100M embeddings with 1024 dimensions in float32 requires 381GB. Adding an HNSW vector index, you quickly need 500 GB of memory.

How do you make it available to the Memory-Poor?
๐•๐ž๐œ๐ญ๐จ๐ซ ๐‚๐จ๐ฆ๐ฉ๐ซ๐ž๐ฌ๐ฌ๐ข๐จ๐ง
Step 1 is compressing your vectors with Product Quantization (PQ), reducing the size from 4096 bytes to just 128 bytes.

Cohere Embed V3 models was trained to work extremely well with vector compression, including int8, binary & PQ.
Read 8 tweets
Mar 13, 2024
๐Ÿ‡บ๐Ÿ‡ณ๐Ÿ๐Ÿ“๐ŸŽ๐Œ ๐–๐ข๐ค๐ข๐ฉ๐ž๐๐ข๐š ๐„๐ฆ๐›๐ž๐๐๐ข๐ง๐ ๐ฌ ๐ข๐ง ๐Ÿ‘๐ŸŽ๐ŸŽ+ ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž๐ฌ ๐Ÿ‡บ๐Ÿ‡ณ

What could you build if your RAG has access to Wikipedia in all 300+ languages?

Available for anyone to use, using our state-of-the-art multilingual embedding model:
huggingface.co/datasets/Coherโ€ฆ
Image
@pinecone showed that RAG makes LLMs better. The more data LLMs can retrieve from, the better (higher faithfulness = more factually correct).

But access to large retrieval datasets so far is challenging๐Ÿ˜ก

We are here to change this ๐Ÿš€

pinecone.io/blog/rag-study/
Image
This week, @cohere published the 35B Command-R model, a super efficient LLM with 128k context length optimized for production RAG work loads on 10 languages.

It achieves superb results for Knowledge Intensive Language Tasks (KILT).

More:
txt.cohere.com/command-r/
Image
Read 5 tweets
Dec 16, 2022
๐Ÿ”ŽSemantic Search Hackathon๐Ÿ”

Today I will kick-off a 7-day virtual hackathon focused on semantic search.

A special focus will be our multilingual embedding model, which makes multilingual search 10x easier while giving you way better results.

Details:
lablab.ai/event/semanticโ€ฆ
Most search functions are rather useless๐Ÿ˜ฐ

Search on Wikipedia for "the capital of the United States" and the capital punishment article is ranked first. The article for Washington D.C. is not even among the top-20 results ๐Ÿ‘Ž

Semantic search can make your search so much better
Semantic search works by embedding text into a vector space. A search query is mapped to the same space and close points are the most relevant docs for the query.

It gives you a search function that actually works.
Read 8 tweets
Dec 12, 2022
๐Ÿ‡บ๐Ÿ‡ณSemantic Search finally works across languages! ๐Ÿ‡บ๐Ÿ‡ณ

Semantic Search gives great search results, but worked so far just for English๐Ÿ˜ฐ

Glad to share our new cohere multilingual embedding model for 100+ languages. And the results are amazing ๐Ÿ“ˆ

Details:
txt.cohere.ai/multilingual/ ImageImage
The usage is extremely simple:
- Sign-up for a free dev API key: dashboard.cohere.ai/welcome/registโ€ฆ
- Install the SDK: pip install cohere
- Call co.embed with the new model Image
Lexical search for multilingual data is painful ๐Ÿคฌ
- Different langs require different tokenizers, stop words, stemmers
- Each languages ends up in its own index
- You need lang. identification for queries & docs
- Common platforms like Elasticsearch only support few languages ๐Ÿ˜ฐ Image
Read 9 tweets
Mar 2, 2022
๐Ÿง‘โ€๐ŸซHow to adapt text embedding models to a domain?

๐Ÿ˜ŸText embedding models perform poorly on unseen domains
โ“How to encode words you have never seen?

๐ŸŽ‰Adaptive pre-training and Generative Pseudo Labeling can help

A ๐Ÿงต with methods & results
๐Ÿ˜ŸText embeddings models perform often poorly on unseen domains.

The issue is that they don't know what certain words mean and how to represent them in the vector space.

If you have never seen the word BERT, how to know that it is connected to deep learning & NLP? Image
๐Ÿ‹๏ธโ€โ™€๏ธOption 1: Adaptive Pre-Training
- You pre-train on your target domain
- You fine-tune on labeled data e.g. from huggingface.co/datasets/senteโ€ฆ

The issue: Fine-tuning on labeled data can be expensive, especially for large datasets. ImageImageImage
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(