Tweet

Nils Reimers

Feb 11 • 6 tweets • 4 min read

@huggingface

🎉Sentence-Transformers v2.2.0 released:

💬 T5 added for computing embeddings
💾 Sentence-T5 and GTR models in PyTorch on @huggingface model hub
🔒 Loading private models from @huggingface hub

github.com/UKPLab/sentenc…

@GoogleAI

Sentence-T5 (arxiv.org/abs/2108.08877) and GTR (arxiv.org/abs/2112.07899) are two recent dense embedding models by @GoogleAI trained on 2B community Q&A pairs. I converted these models to PyTorch:
huggingface.co/sentence-trans…
huggingface.co/sentence-trans…

📈How well does sentence-T5 / GTR perform?

🥇On sentence tasks, the XXL models were able to set a new state-of-the-art. Sadly they are quite slow
🥈On semantic search tasks (given query, find relevant passages), better and faster models exist

Sentence-T5 and GTR have been trained on:
📁large datasets (2B pairs)
🖥️A lot compute power (2048 batch size)

❓But how does T5 compare if training is comparable?

I testedT5-base & T5 v1.1.-base vs encoder-only models in a comparable setting (same training data & compute):

T5 encoders require quite many training steps before producing good text embeddings.
🐁 Small datasets: T5 is quite bad
🐕 Medium datasets: T5 catches up, but not as good as encoder-only models
🐘 Large datasets: Still unknown if T5 or mpnet-base is better

@huggingface

🔒Private models on @huggingface hub

🕵️Simply upload your sentence-transformer model to the HF hub and mark it private. Only you will have access to it.

Load your model with:
model = SentenceTransformer("your-username/your-model", use_auth_token=True)

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Nils_Reimers

Nils Reimers

@Nils_Reimers

Jan 28

@OpenAI

GPT-3 Embeddings by @OpenAI was announced this week.

📈 I was excited and tested them on 20 datasets
😢 Sadly they are worse than open models that are 1000 x smaller
💰 Running @OpenAI models can be a 1 million times more expensive

tinyurl.com/gpt3-emb

I tested the text similarity models on 14 datasets from different domains (emails, papers, online communities) on various tasks (clustering, retrieval, paraphrase mining).

The 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.

Next, I tested the text-search models. Here the results look well for a dense model.

However, when compared to the state-of-the-art sparse model of SpladeV2, which is 2600x smaller, you just get an 0.1 improvement.

💰 Encoding costs? $1,000,000 for GPT-3 vs. $3 for SpladeV2

Read 7 tweets

Nils Reimers

@Nils_Reimers

Sep 8, 2021

🚨Model Alert🚨
🏋️‍♂️ State-of-the-art sentence & paragraph embedding models
🍻State-of-the-art semantic search models
🔢State-of-the-art on MS MARCO for dense retrieval
📂1.2B training pairs corpus
👩‍🎓215M Q&A-training pairs
🌐Everything Available: SBERT.net
🧵

@huggingface

🚨 All-Purpose Sentence & Paragraph Embeddings Models
As part of the JAX community week from @huggingface we collected a corpus of 1.2 billion training pairs => Great embeddings for sentences & paragraphs

Models: sbert.net/docs/pretraine…
Training Data: huggingface.co/datasets/sente…

🚨Semantic Search Models
Performing well on out-of-domain data is challenging for neural retrieval models. By training on 215 million (question, answer)-pairs, we get models that generalize well across domains.

Models: sbert.net/docs/pretraine…
Data: huggingface.co/sentence-trans…

Read 9 tweets

Nils Reimers

@Nils_Reimers

Jun 22, 2021

📺How to train state-of-the-art sentence embeddings? 📺

Just uploaded my 3-part video series on the theory how to train state-of-the-art sentence embedding models:

📺 Part 1 - Applications & Definition
- Why do we need dense representation?
- Definition of dense representation
- What does "semantically similar" mean?
- Applications: Clustering, Search, Zero- & Few-Shot-Classification...

📺 Part 2 - Applications & Definition
- Basic Training Setup
- Loss-Functions: Contrastive Loss, Triplet Loss, Batch Hard Triplet Loss, Multiple Negatives Ranking Loss
- Training with hard negatives for semantic search
- Mining of hard negatives

Read 5 tweets

Nils Reimers

@Nils_Reimers

Jan 27, 2021

@huggingface

New Project: EasyNMT (github.com/UKPLab/EasyNMT)

Easy-to-use, state-of-the-art Neural Machine Translation using @huggingface and @fairseq.

- Translation for 150+ languages
- Sentence & document translation
- Automatic Language Detection

Colab example: colab.research.google.com/drive/1X47vgSi…

@HelsinkiNLP

Currently 4 state-of-the-art models are supported:
- OPUS-MT models from @HelsinkiNLP (individual models for 150+languages)
- mBART50 many-to-many translation for 50 langs from @facebookai
- m2m_100 many-to-many translation for 100 langs from @facebookai (418M and 1.2B version)

Document translation: Transformer-based models have a length limit of 512 / 1024 word pieces.

EasyNMT is able to translate documents of any lengths by splitting it into smaller chunks, translating these, and then reconstructing the full document.

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Nils Reimers

Try unrolling a thread yourself!

More from @Nils_Reimers

Nils Reimers

Nils Reimers

Nils Reimers

Nils Reimers

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?