Tweet

Nils Reimers

8 Sep, 9 tweets, 7 min read

🚨Model Alert🚨
🏋️‍♂️ State-of-the-art sentence & paragraph embedding models
🍻State-of-the-art semantic search models
🔢State-of-the-art on MS MARCO for dense retrieval
📂1.2B training pairs corpus
👩‍🎓215M Q&A-training pairs
🌐Everything Available: SBERT.net
🧵

@huggingface

🚨 All-Purpose Sentence & Paragraph Embeddings Models
As part of the JAX community week from @huggingface we collected a corpus of 1.2 billion training pairs => Great embeddings for sentences & paragraphs

Models: sbert.net/docs/pretraine…
Training Data: huggingface.co/datasets/sente…

🚨Semantic Search Models
Performing well on out-of-domain data is challenging for neural retrieval models. By training on 215 million (question, answer)-pairs, we get models that generalize well across domains.

Models: sbert.net/docs/pretraine…
Data: huggingface.co/sentence-trans…

🚨State-of-the-art MS MARCO Models
We mined 160 million hard negatives for the MS MARCO dataset and scored them with a CrossEncoder.

We then trained Bi-Encoders on these using MarginMSE Loss (arxiv.org/abs/2010.02666)

Code: sbert.net/examples/train…
Data: huggingface.co/datasets/sente…

📂1.2 Billion Training Pairs
Training on large datasets is essential to generalize well across domains and tasks.

Previous models were trained on rather small datasets of a few 100k train pairs and had issues on specialized topics.

We collected 1.2B training pairs from ...

... different sources: Reddit, Scientific Publications, WikiAnswers, StackExchange, Yahoo Answers, Quora, CodeSearch...

They are shared in a unified format here: huggingface.co/datasets/sente…

@PatrickPlaten

👫In July, @PatrickPlaten & @psuraj28 organized the JAX community event.

A large group of people joined to train the best embedding models in existence:
discuss.huggingface.co/t/train-the-be…

We collected 1.2B training pairs and trained the models for about a week on a TPU-v3-8

📺How did we train these models?
As part of the event I gave two talks how to train these models:

Code: github.com/nreimers/se-py…
Video: Detailed explanation about the training code

Video: Theory on training embedding models:

🇺🇳 What about multilingual models?
Improved multilingual models are in progress.

Multilingual Knowledge Distillation (arxiv.org/abs/2004.09813) requires good parallel data.

- Parallel data for sentences: huggingface.co/datasets/sente…
- Parallel data for paragraphs: 🚧 In progress

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Nils_Reimers

Nils Reimers

@Nils_Reimers

22 Jun

📺How to train state-of-the-art sentence embeddings? 📺

Just uploaded my 3-part video series on the theory how to train state-of-the-art sentence embedding models:

📺 Part 1 - Applications & Definition
- Why do we need dense representation?
- Definition of dense representation
- What does "semantically similar" mean?
- Applications: Clustering, Search, Zero- & Few-Shot-Classification...

📺 Part 2 - Applications & Definition
- Basic Training Setup
- Loss-Functions: Contrastive Loss, Triplet Loss, Batch Hard Triplet Loss, Multiple Negatives Ranking Loss
- Training with hard negatives for semantic search
- Mining of hard negatives

Read 5 tweets

Nils Reimers

@Nils_Reimers

27 Jan

@huggingface

New Project: EasyNMT (github.com/UKPLab/EasyNMT)

Easy-to-use, state-of-the-art Neural Machine Translation using @huggingface and @fairseq.

- Translation for 150+ languages
- Sentence & document translation
- Automatic Language Detection

Colab example: colab.research.google.com/drive/1X47vgSi…

@HelsinkiNLP

Currently 4 state-of-the-art models are supported:
- OPUS-MT models from @HelsinkiNLP (individual models for 150+languages)
- mBART50 many-to-many translation for 50 langs from @facebookai
- m2m_100 many-to-many translation for 100 langs from @facebookai (418M and 1.2B version)

Document translation: Transformer-based models have a length limit of 512 / 1024 word pieces.

EasyNMT is able to translate documents of any lengths by splitting it into smaller chunks, translating these, and then reconstructing the full document.

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Nils Reimers

Try unrolling a thread yourself!

More from @Nils_Reimers

Nils Reimers

Nils Reimers

Did Thread Reader help you today?

Like this author's thread?