Maarten Grootendorst Profile picture
Feb 14, 2023 8 tweets 4 min read Read on X
The v0.14 release of BERTopic is here 🥳 Fine-tune your topic keywords and labels with models from @OpenAI, @huggingface, @CohereAI, @spacy_io, and @LangChainAI.

Use models for part-of-speech tagging, text generation, zero-shot classification, and more!

An overview thread👇🧵
Use OpenAI's or Cohere's GPT models to suggest topic labels. For each topic, only a single API is needed, significantly reducing costs by focusing on representative documents and keywords. You can even perform prompt engineering by customizing the prompts.
Use a KeyBERT-inspired model to further fine-tune the topic keywords. It makes use of c-TF-IDF to generate candidate keywords and representative documents from which to extract the improved topic keywords. It borrows many ideas from KeyBERT but optimizes it for topic generation.
Apply POS tagging with Spacy to improve the topic keywords. We leverage c-TF-IDF to perform POS tagging on a subset of representative keywords and documents. Customize the POS patterns that you are interested in to optimize the extracted keywords.
Use publicly-available text-generation models with @huggingface! We documents and keywords that are representative of a topic to these models and ask them to generate topic labels. Customization of your prompts can have a huge influence on the output.
Diversify the topic keywords with MaximalMarginalRelevance. Although it was already implemented in BERTopic, I felt like it deserved to have its own representation model. It is a quick way to improve the generated keywords!
We can even chain different models to sequentially fine-tune the topic keywords and/or labels. Here, we first use a KeyBERT-inspired model to create our topics and then diversify the output with MMR. Chain as many models as you want!
And that is not it! We can perform zero-shot classification on the topic labels, apply LangChain for more LLM customization, and have fun with prompt engineering. Learn more about BERTopic and the new models here: maartengr.github.io/BERTopic/chang…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Maarten Grootendorst

Maarten Grootendorst Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @MaartenGr

Feb 11
Did you know we continue to develop new content for the "Hands-On Large Language Models" book?

There's now even a free course available with @DeepLearningAI! Image
@JayAlammar and I are incredibly proud to bring you this highly animated (and free 😉) course: Image
There are also guides to common principles like Quantization and Mixture of Experts: Image
Read 5 tweets
Feb 3
A Visual Guide to Reasoning LLMs 💭

With over 40 custom visuals, explore DeepSeek-R1, the train-time compute paradigm shift, test-time compute techniques, verifiers, STaR, and much more!

Link below Image
From exploring verifiers for distilling reasoning: Image
All the way to DeepSeek-R1-(zero): Image
Read 4 tweets
May 31, 2023
Multimodal, multi-aspect, Hugging Face Hub, safetensors, and more in BERTopic v0.15 🔥

Working together with @huggingface on this was a blast!

🤗Blog: huggingface.co/blog/bertopic
🤗Hub Example: huggingface.co/MaartenGr/BERT…
Changelog: maartengr.github.io/BERTopic/chang…

An update thread🧵 Image
Apply textual topic modeling on images with the new update (🖼️+ 🖹 or 🖼️ only)!

Introducing a multi-modal Clip backend that embeds text and images.

Even when you have only images, you can caption the most representative images of each topic and extract textual representations! Image
Easily share your topic modeling on the Hugging Face Hub! After a great collaboration with @huggingface and inspired by the work at github.com/opinionscience…, you can now load and share pre-trained BERTopic models from the 🤗Hub.

Try it out yourself:

huggingface.co/MaartenGr/BERT…
Read 7 tweets
Dec 28, 2022
Final Preview: Outlier Reduction!

In the upcoming release of BERTopic, it will be possible to perform outlier reduction! Easily explore several strategies for outlier reduction after training your topic model. A flexible and modular approach!

A preview thread👇🧵
Strategy #1
The first strategy to reduce outliers is by making use of the soft-clustering capabilities of HDBSCAN. We find the best matching topic for each outlier document by looking at the topic-document probabilities generated by HDBSCAN.
Strategy #2
The newly added `.approximate_distribution` allows us to generate topic distributions for each document, even outlier documents. As such, we can use those topic distributions to assign outlier documents to non-outlier topics.
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(