Nils Reimers Profile picture
Director of Machine Learning @Cohere | ex-huggingface | Creator of SBERT (https://t.co/MKKOMfuQ4C)
Aug 1, 2025 โ€ข 6 tweets โ€ข 2 min read
๐„๐ง๐๐Ÿ๐„๐ง๐ ๐•๐ข๐ฌ๐ข๐จ๐ง-๐‘๐€๐† ๐ฐ๐ข๐ญ๐ก ๐‚๐จ๐ก๐ž๐ซ๐ž

Our data is multi-modal ๐Ÿ–ผ๏ธ, but most RAG pipelines are still text-only.

This causes massive problems with complex visual information.

With Cmd-A-Vision from @cohere you now get a sota vision model for Vision-RAG Image Traditional Text-RAG tries to converts images to markdown. However, it looses a lot of the rich information represented in the image ๐Ÿ˜ก Image
Jul 3, 2024 โ€ข 8 tweets โ€ข 3 min read
๐’๐ž๐ฆ๐š๐ง๐ญ๐ข๐œ ๐’๐ž๐š๐ซ๐œ๐ก ๐จ๐ง ๐Ÿ๐ŸŽ๐ŸŽ๐Œ ๐๐จ๐œ๐ฌ - ๐–๐ข๐ญ๐ก ๐Ÿ๐ŸŽ๐ŸŽ๐Œ๐ ๐จ๐Ÿ ๐Œ๐ž๐ฆ๐จ๐ซ๐ฒ

GPU-poor and Memory-poor, and not having 500GB of memory to embed & index 100M docs?

Still want to participate at TREC-RAG 2024?

Introducing ๐ƒ๐ข๐ฌ๐ค๐•๐ž๐œ๐ญ๐จ๐ซ๐ˆ๐ง๐๐ž๐ฑ Image ๐•๐ž๐œ๐ญ๐จ๐ซ ๐’๐ž๐š๐ซ๐œ๐ก ๐Ÿ๐จ๐ซ ๐Œ๐ž๐ฆ๐จ๐ซ๐ฒ-๐๐จ๐จ๐ซ
100M embeddings with 1024 dimensions in float32 requires 381GB. Adding an HNSW vector index, you quickly need 500 GB of memory.

How do you make it available to the Memory-Poor?
Mar 13, 2024 โ€ข 5 tweets โ€ข 3 min read
๐Ÿ‡บ๐Ÿ‡ณ๐Ÿ๐Ÿ“๐ŸŽ๐Œ ๐–๐ข๐ค๐ข๐ฉ๐ž๐๐ข๐š ๐„๐ฆ๐›๐ž๐๐๐ข๐ง๐ ๐ฌ ๐ข๐ง ๐Ÿ‘๐ŸŽ๐ŸŽ+ ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž๐ฌ ๐Ÿ‡บ๐Ÿ‡ณ

What could you build if your RAG has access to Wikipedia in all 300+ languages?

Available for anyone to use, using our state-of-the-art multilingual embedding model:
huggingface.co/datasets/Coherโ€ฆ
Image @pinecone showed that RAG makes LLMs better. The more data LLMs can retrieve from, the better (higher faithfulness = more factually correct).

But access to large retrieval datasets so far is challenging๐Ÿ˜ก

We are here to change this ๐Ÿš€

pinecone.io/blog/rag-study/
Image
Dec 16, 2022 โ€ข 8 tweets โ€ข 4 min read
๐Ÿ”ŽSemantic Search Hackathon๐Ÿ”

Today I will kick-off a 7-day virtual hackathon focused on semantic search.

A special focus will be our multilingual embedding model, which makes multilingual search 10x easier while giving you way better results.

Details:
lablab.ai/event/semanticโ€ฆ Most search functions are rather useless๐Ÿ˜ฐ

Search on Wikipedia for "the capital of the United States" and the capital punishment article is ranked first. The article for Washington D.C. is not even among the top-20 results ๐Ÿ‘Ž

Semantic search can make your search so much better
Dec 12, 2022 โ€ข 9 tweets โ€ข 4 min read
๐Ÿ‡บ๐Ÿ‡ณSemantic Search finally works across languages! ๐Ÿ‡บ๐Ÿ‡ณ

Semantic Search gives great search results, but worked so far just for English๐Ÿ˜ฐ

Glad to share our new cohere multilingual embedding model for 100+ languages. And the results are amazing ๐Ÿ“ˆ

Details:
txt.cohere.ai/multilingual/ ImageImage The usage is extremely simple:
- Sign-up for a free dev API key: dashboard.cohere.ai/welcome/registโ€ฆ
- Install the SDK: pip install cohere
- Call co.embed with the new model Image
Mar 2, 2022 โ€ข 6 tweets โ€ข 4 min read
๐Ÿง‘โ€๐ŸซHow to adapt text embedding models to a domain?

๐Ÿ˜ŸText embedding models perform poorly on unseen domains
โ“How to encode words you have never seen?

๐ŸŽ‰Adaptive pre-training and Generative Pseudo Labeling can help

A ๐Ÿงต with methods & results ๐Ÿ˜ŸText embeddings models perform often poorly on unseen domains.

The issue is that they don't know what certain words mean and how to represent them in the vector space.

If you have never seen the word BERT, how to know that it is connected to deep learning & NLP? Image
Feb 11, 2022 โ€ข 6 tweets โ€ข 4 min read
๐ŸŽ‰Sentence-Transformers v2.2.0 released:

๐Ÿ’ฌ T5 added for computing embeddings
๐Ÿ’พ Sentence-T5 and GTR models in PyTorch on @huggingface model hub
๐Ÿ”’ Loading private models from @huggingface hub

github.com/UKPLab/sentencโ€ฆ Sentence-T5 (arxiv.org/abs/2108.08877) and GTR (arxiv.org/abs/2112.07899) are two recent dense embedding models by @GoogleAI trained on 2B community Q&A pairs. I converted these models to PyTorch:
huggingface.co/sentence-transโ€ฆ
huggingface.co/sentence-transโ€ฆ
Jan 28, 2022 โ€ข 7 tweets โ€ข 4 min read
GPT-3 Embeddings by @OpenAI was announced this week.

๐Ÿ“ˆ I was excited and tested them on 20 datasets
๐Ÿ˜ข Sadly they are worse than open models that are 1000 x smaller
๐Ÿ’ฐ Running @OpenAI models can be a 1 million times more expensive

tinyurl.com/gpt3-emb I tested the text similarity models on 14 datasets from different domains (emails, papers, online communities) on various tasks (clustering, retrieval, paraphrase mining).

The 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.
Sep 8, 2021 โ€ข 9 tweets โ€ข 7 min read
๐ŸšจModel Alert๐Ÿšจ
๐Ÿ‹๏ธโ€โ™‚๏ธ State-of-the-art sentence & paragraph embedding models
๐ŸปState-of-the-art semantic search models
๐Ÿ”ขState-of-the-art on MS MARCO for dense retrieval
๐Ÿ“‚1.2B training pairs corpus
๐Ÿ‘ฉโ€๐ŸŽ“215M Q&A-training pairs
๐ŸŒEverything Available: SBERT.net
๐Ÿงต ๐Ÿšจ All-Purpose Sentence & Paragraph Embeddings Models
As part of the JAX community week from @huggingface we collected a corpus of 1.2 billion training pairs => Great embeddings for sentences & paragraphs

Models: sbert.net/docs/pretraineโ€ฆ
Training Data: huggingface.co/datasets/senteโ€ฆ
Jun 22, 2021 โ€ข 5 tweets โ€ข 2 min read
๐Ÿ“บHow to train state-of-the-art sentence embeddings? ๐Ÿ“บ

Just uploaded my 3-part video series on the theory how to train state-of-the-art sentence embedding models:
๐Ÿ“บ Part 1 - Applications & Definition
- Why do we need dense representation?
- Definition of dense representation
- What does "semantically similar" mean?
- Applications: Clustering, Search, Zero- & Few-Shot-Classification...

Jan 27, 2021 โ€ข 4 tweets โ€ข 3 min read
New Project: EasyNMT (github.com/UKPLab/EasyNMT)

Easy-to-use, state-of-the-art Neural Machine Translation using @huggingface and @fairseq.

- Translation for 150+ languages
- Sentence & document translation
- Automatic Language Detection

Colab example: colab.research.google.com/drive/1X47vgSiโ€ฆ Currently 4 state-of-the-art models are supported:
- OPUS-MT models from @HelsinkiNLP (individual models for 150+languages)
- mBART50 many-to-many translation for 50 langs from @facebookai
- m2m_100 many-to-many translation for 100 langs from @facebookai (418M and 1.2B version)