📺How to train state-of-the-art sentence embeddings? 📺
Just uploaded my 3-part video series on the theory how to train state-of-the-art sentence embedding models:
📺 Part 1 - Applications & Definition
- Why do we need dense representation?
- Definition of dense representation
- What does "semantically similar" mean?
- Applications: Clustering, Search, Zero- & Few-Shot-Classification...
📺 Part 2 - Applications & Definition
- Basic Training Setup
- Loss-Functions: Contrastive Loss, Triplet Loss, Batch Hard Triplet Loss, Multiple Negatives Ranking Loss
- Training with hard negatives for semantic search
- Mining of hard negatives
📺Part 3 - Advanced Training
- Multilingual Text Embeddings
- Data Augmentation with Cross-Encoders
- Unsupervised Text Embedding Learning
- Pre-Training Methods for dense representations
- Neural Search
You are interested in actual code examples? Check the docs at sbert.net
Currently 4 state-of-the-art models are supported:
- OPUS-MT models from @HelsinkiNLP (individual models for 150+languages)
- mBART50 many-to-many translation for 50 langs from @facebookai
- m2m_100 many-to-many translation for 100 langs from @facebookai (418M and 1.2B version)
Document translation: Transformer-based models have a length limit of 512 / 1024 word pieces.
EasyNMT is able to translate documents of any lengths by splitting it into smaller chunks, translating these, and then reconstructing the full document.