🥇On sentence tasks, the XXL models were able to set a new state-of-the-art. Sadly they are quite slow
🥈On semantic search tasks (given query, find relevant passages), better and faster models exist
Sentence-T5 and GTR have been trained on:
📁large datasets (2B pairs)
🖥️A lot compute power (2048 batch size)
❓But how does T5 compare if training is comparable?
I testedT5-base & T5 v1.1.-base vs encoder-only models in a comparable setting (same training data & compute):
T5 encoders require quite many training steps before producing good text embeddings.
🐁 Small datasets: T5 is quite bad
🐕 Medium datasets: T5 catches up, but not as good as encoder-only models
🐘 Large datasets: Still unknown if T5 or mpnet-base is better
GPT-3 Embeddings by @OpenAI was announced this week.
📈 I was excited and tested them on 20 datasets
😢 Sadly they are worse than open models that are 1000 x smaller
💰 Running @OpenAI models can be a 1 million times more expensive
I tested the text similarity models on 14 datasets from different domains (emails, papers, online communities) on various tasks (clustering, retrieval, paraphrase mining).
The 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.
Next, I tested the text-search models. Here the results look well for a dense model.
However, when compared to the state-of-the-art sparse model of SpladeV2, which is 2600x smaller, you just get an 0.1 improvement.
💰 Encoding costs? $1,000,000 for GPT-3 vs. $3 for SpladeV2
🚨Model Alert🚨
🏋️♂️ State-of-the-art sentence & paragraph embedding models
🍻State-of-the-art semantic search models
🔢State-of-the-art on MS MARCO for dense retrieval
📂1.2B training pairs corpus
👩🎓215M Q&A-training pairs
🌐Everything Available: SBERT.net
🧵
🚨 All-Purpose Sentence & Paragraph Embeddings Models
As part of the JAX community week from @huggingface we collected a corpus of 1.2 billion training pairs => Great embeddings for sentences & paragraphs
🚨Semantic Search Models
Performing well on out-of-domain data is challenging for neural retrieval models. By training on 215 million (question, answer)-pairs, we get models that generalize well across domains.
📺How to train state-of-the-art sentence embeddings? 📺
Just uploaded my 3-part video series on the theory how to train state-of-the-art sentence embedding models:
📺 Part 1 - Applications & Definition
- Why do we need dense representation?
- Definition of dense representation
- What does "semantically similar" mean?
- Applications: Clustering, Search, Zero- & Few-Shot-Classification...
📺 Part 2 - Applications & Definition
- Basic Training Setup
- Loss-Functions: Contrastive Loss, Triplet Loss, Batch Hard Triplet Loss, Multiple Negatives Ranking Loss
- Training with hard negatives for semantic search
- Mining of hard negatives
Currently 4 state-of-the-art models are supported:
- OPUS-MT models from @HelsinkiNLP (individual models for 150+languages)
- mBART50 many-to-many translation for 50 langs from @facebookai
- m2m_100 many-to-many translation for 100 langs from @facebookai (418M and 1.2B version)
Document translation: Transformer-based models have a length limit of 512 / 1024 word pieces.
EasyNMT is able to translate documents of any lengths by splitting it into smaller chunks, translating these, and then reconstructing the full document.