GPT-3 Embeddings by @OpenAI was announced this week.
๐ I was excited and tested them on 20 datasets
๐ข Sadly they are worse than open models that are 1000 x smaller
๐ฐ Running @OpenAI models can be a 1 million times more expensive
I tested the text similarity models on 14 datasets from different domains (emails, papers, online communities) on various tasks (clustering, retrieval, paraphrase mining).
The 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.
Next, I tested the text-search models. Here the results look well for a dense model.
However, when compared to the state-of-the-art sparse model of SpladeV2, which is 2600x smaller, you just get an 0.1 improvement.
๐ฐ Encoding costs? $1,000,000 for GPT-3 vs. $3 for SpladeV2
When evaluated on 6 (query/questions, paragraph)-tasks, the OpenAI 2.7B & 6.7B parameter models perform on par with an open 110M parameter model (MPNet). Again, encoding costs are about 1000 higher.
The @OpenAI embedding models produce extremely high dimensional vector spaces of up to 12288 dimensions.
The issue: With more dimensions, your machine requires a lot more memory ($$$) to host such a vector space and operations like search is a lot slower.
My advice:
๐ฐ Safe the $1,000,000 you would need to spend to encode your corpus with GPT-3
๐ Spent $1000 and annotate task specific data
๐Fine-tune an open model
๐ Use the $999,000 saving to treat your team
You can find the full analysis, further details, more results & explanations, and links to the alternative open models in the blog post:
๐๐๐๐ญ๐จ๐ซ ๐๐๐๐ซ๐๐ก ๐๐จ๐ซ ๐๐๐ฆ๐จ๐ซ๐ฒ-๐๐จ๐จ๐ซ
100M embeddings with 1024 dimensions in float32 requires 381GB. Adding an HNSW vector index, you quickly need 500 GB of memory.
How do you make it available to the Memory-Poor?
๐๐๐๐ญ๐จ๐ซ ๐๐จ๐ฆ๐ฉ๐ซ๐๐ฌ๐ฌ๐ข๐จ๐ง
Step 1 is compressing your vectors with Product Quantization (PQ), reducing the size from 4096 bytes to just 128 bytes.
Cohere Embed V3 models was trained to work extremely well with vector compression, including int8, binary & PQ.
This week, @cohere published the 35B Command-R model, a super efficient LLM with 128k context length optimized for production RAG work loads on 10 languages.
It achieves superb results for Knowledge Intensive Language Tasks (KILT).
Search on Wikipedia for "the capital of the United States" and the capital punishment article is ranked first. The article for Washington D.C. is not even among the top-20 results ๐
Semantic search can make your search so much better
Semantic search works by embedding text into a vector space. A search query is mapped to the same space and close points are the most relevant docs for the query.
It gives you a search function that actually works.
The usage is extremely simple:
- Sign-up for a free dev API key: dashboard.cohere.ai/welcome/registโฆ
- Install the SDK: pip install cohere
- Call co.embed with the new model
Lexical search for multilingual data is painful ๐คฌ
- Different langs require different tokenizers, stop words, stemmers
- Each languages ends up in its own index
- You need lang. identification for queries & docs
- Common platforms like Elasticsearch only support few languages ๐ฐ
๐งโ๐ซHow to adapt text embedding models to a domain?
๐Text embedding models perform poorly on unseen domains
โHow to encode words you have never seen?
๐Adaptive pre-training and Generative Pseudo Labeling can help
A ๐งต with methods & results
๐Text embeddings models perform often poorly on unseen domains.
The issue is that they don't know what certain words mean and how to represent them in the vector space.
If you have never seen the word BERT, how to know that it is connected to deep learning & NLP?
๐๏ธโโ๏ธOption 1: Adaptive Pre-Training
- You pre-train on your target domain
- You fine-tune on labeled data e.g. from huggingface.co/datasets/senteโฆ
The issue: Fine-tuning on labeled data can be expensive, especially for large datasets.