Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Nils Reimers

@Nils_Reimers

Sep 8, 2021 • 9 tweets • 7 min read • Read on X

Scrolly

🚨Model Alert🚨
🏋️‍♂️ State-of-the-art sentence & paragraph embedding models
🍻State-of-the-art semantic search models
🔢State-of-the-art on MS MARCO for dense retrieval
📂1.2B training pairs corpus
👩‍🎓215M Q&A-training pairs
🌐Everything Available: SBERT.net
🧵

@huggingface

🚨 All-Purpose Sentence & Paragraph Embeddings Models
As part of the JAX community week from @huggingface we collected a corpus of 1.2 billion training pairs => Great embeddings for sentences & paragraphs

Models: sbert.net/docs/pretraine…
Training Data: huggingface.co/datasets/sente…

🚨Semantic Search Models
Performing well on out-of-domain data is challenging for neural retrieval models. By training on 215 million (question, answer)-pairs, we get models that generalize well across domains.

Models: sbert.net/docs/pretraine…
Data: huggingface.co/sentence-trans…

🚨State-of-the-art MS MARCO Models
We mined 160 million hard negatives for the MS MARCO dataset and scored them with a CrossEncoder.

We then trained Bi-Encoders on these using MarginMSE Loss (arxiv.org/abs/2010.02666)

Code: sbert.net/examples/train…
Data: huggingface.co/datasets/sente…

📂1.2 Billion Training Pairs
Training on large datasets is essential to generalize well across domains and tasks.

Previous models were trained on rather small datasets of a few 100k train pairs and had issues on specialized topics.

We collected 1.2B training pairs from ...

... different sources: Reddit, Scientific Publications, WikiAnswers, StackExchange, Yahoo Answers, Quora, CodeSearch...

They are shared in a unified format here: huggingface.co/datasets/sente…

@PatrickPlaten

👫In July, @PatrickPlaten & @psuraj28 organized the JAX community event.

A large group of people joined to train the best embedding models in existence:
discuss.huggingface.co/t/train-the-be…

We collected 1.2B training pairs and trained the models for about a week on a TPU-v3-8

📺How did we train these models?
As part of the event I gave two talks how to train these models:

Code: github.com/nreimers/se-py…
Video: Detailed explanation about the training code

Video: Theory on training embedding models:

🇺🇳 What about multilingual models?
Improved multilingual models are in progress.

Multilingual Knowledge Distillation (arxiv.org/abs/2004.09813) requires good parallel data.

- Parallel data for sentences: huggingface.co/datasets/sente…
- Parallel data for paragraphs: 🚧 In progress

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Nils_Reimers

Nils Reimers

@Nils_Reimers

Aug 1, 2025

𝐄𝐧𝐝𝟐𝐄𝐧𝐝 𝐕𝐢𝐬𝐢𝐨𝐧-𝐑𝐀𝐆 𝐰𝐢𝐭𝐡 𝐂𝐨𝐡𝐞𝐫𝐞

Our data is multi-modal 🖼️, but most RAG pipelines are still text-only.

This causes massive problems with complex visual information.

With Cmd-A-Vision from @cohere you now get a sota vision model for Vision-RAG

Traditional Text-RAG tries to converts images to markdown. However, it looses a lot of the rich information represented in the image 😡

Vision-RAG skips these issues, by operating on the vision domain end2end. No more issues with faulty PDF2Mardown / Image2Markdown.

Read 6 tweets

Nils Reimers

@Nils_Reimers

Jul 3, 2024

𝐒𝐞𝐦𝐚𝐧𝐭𝐢𝐜 𝐒𝐞𝐚𝐫𝐜𝐡 𝐨𝐧 𝟏𝟎𝟎𝐌 𝐝𝐨𝐜𝐬 - 𝐖𝐢𝐭𝐡 𝟏𝟎𝟎𝐌𝐁 𝐨𝐟 𝐌𝐞𝐦𝐨𝐫𝐲

GPU-poor and Memory-poor, and not having 500GB of memory to embed & index 100M docs?

Still want to participate at TREC-RAG 2024?

Introducing 𝐃𝐢𝐬𝐤𝐕𝐞𝐜𝐭𝐨𝐫𝐈𝐧𝐝𝐞𝐱

𝐕𝐞𝐜𝐭𝐨𝐫 𝐒𝐞𝐚𝐫𝐜𝐡 𝐟𝐨𝐫 𝐌𝐞𝐦𝐨𝐫𝐲-𝐏𝐨𝐨𝐫
100M embeddings with 1024 dimensions in float32 requires 381GB. Adding an HNSW vector index, you quickly need 500 GB of memory.

How do you make it available to the Memory-Poor?

𝐕𝐞𝐜𝐭𝐨𝐫 𝐂𝐨𝐦𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧
Step 1 is compressing your vectors with Product Quantization (PQ), reducing the size from 4096 bytes to just 128 bytes.

Cohere Embed V3 models was trained to work extremely well with vector compression, including int8, binary & PQ.

Read 8 tweets

Nils Reimers

@Nils_Reimers

Mar 13, 2024

🇺🇳𝟐𝟓𝟎𝐌 𝐖𝐢𝐤𝐢𝐩𝐞𝐝𝐢𝐚 𝐄𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠𝐬 𝐢𝐧 𝟑𝟎𝟎+ 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞𝐬 🇺🇳

What could you build if your RAG has access to Wikipedia in all 300+ languages?

Available for anyone to use, using our state-of-the-art multilingual embedding model:
huggingface.co/datasets/Coher…

@pinecone showed that RAG makes LLMs better. The more data LLMs can retrieve from, the better (higher faithfulness = more factually correct).

But access to large retrieval datasets so far is challenging😡

We are here to change this 🚀

pinecone.io/blog/rag-study/

This week, @cohere published the 35B Command-R model, a super efficient LLM with 128k context length optimized for production RAG work loads on 10 languages.

It achieves superb results for Knowledge Intensive Language Tasks (KILT).

More:
txt.cohere.com/command-r/

Read 5 tweets

Nils Reimers

@Nils_Reimers

Dec 16, 2022

🔎Semantic Search Hackathon🔍

Today I will kick-off a 7-day virtual hackathon focused on semantic search.

A special focus will be our multilingual embedding model, which makes multilingual search 10x easier while giving you way better results.

Details:
lablab.ai/event/semantic…

Most search functions are rather useless😰

Search on Wikipedia for "the capital of the United States" and the capital punishment article is ranked first. The article for Washington D.C. is not even among the top-20 results 👎

Semantic search can make your search so much better

Semantic search works by embedding text into a vector space. A search query is mapped to the same space and close points are the most relevant docs for the query.

It gives you a search function that actually works.

Read 8 tweets

Nils Reimers

@Nils_Reimers

Dec 12, 2022

🇺🇳Semantic Search finally works across languages! 🇺🇳

Semantic Search gives great search results, but worked so far just for English😰

Glad to share our new cohere multilingual embedding model for 100+ languages. And the results are amazing 📈

Details:
txt.cohere.ai/multilingual/

The usage is extremely simple:
- Sign-up for a free dev API key: dashboard.cohere.ai/welcome/regist…
- Install the SDK: pip install cohere
- Call co.embed with the new model

Lexical search for multilingual data is painful 🤬
- Different langs require different tokenizers, stop words, stemmers
- Each languages ends up in its own index
- You need lang. identification for queries & docs
- Common platforms like Elasticsearch only support few languages 😰

Read 9 tweets

Nils Reimers

@Nils_Reimers

Mar 2, 2022

🧑‍🏫How to adapt text embedding models to a domain?

😟Text embedding models perform poorly on unseen domains
❓How to encode words you have never seen?

🎉Adaptive pre-training and Generative Pseudo Labeling can help

A 🧵 with methods & results

😟Text embeddings models perform often poorly on unseen domains.

The issue is that they don't know what certain words mean and how to represent them in the vector space.

If you have never seen the word BERT, how to know that it is connected to deep learning & NLP?

🏋️‍♀️Option 1: Adaptive Pre-Training
- You pre-train on your target domain
- You fine-tune on labeled data e.g. from huggingface.co/datasets/sente…

The issue: Fine-tuning on labeled data can be expensive, especially for large datasets.

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Nils Reimers

Try unrolling a thread yourself!

More from @Nils_Reimers

Nils Reimers

Nils Reimers

Nils Reimers

Nils Reimers

Nils Reimers

Nils Reimers

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!