Marco Giordano Profile picture
Feb 12, 2022 21 tweets 4 min read Read on X
One of the most useful #NLP libraries for #SEO in #Python is certainly BERTopic.

I will show you its benefits, why it's so powerful and simple to use in this thread 🧵
BERTopic is the easy and comfortable way of using advanced linguistics models without writing too much code.

That's why it's so powerful and reliable.
Although this library wasn't built with SEO in mind, it's clearly super versatile for us.

It's a way to flatten the steep learning curve that such topics possess.

We're focusing on the implementation itself rather than the theory. >>>
>>> This doesn't imply that you don't have to study the models! You should mature an understanding of the high level overview and the parameters.

It's very unlikely to have good results without tuning your models.
If you are like me and want to focus on NLP and Data Science this is the right way to go. Transformers and recent models are way better than older ones and are able to capture the semantic nature of words.

This is not possible with a traditional clustering technique.
Some terms you need to know are:

Embeddings - think of them as representing words in math language, i.e. vectors

Topic modelling - identifying topics in a set of documents

Transformers - Deep Learning models based on attention

>>>
>>> These are very broad definitions to get you started, do your research.

The idea here is to have the minimum level to get started with BERTopic.
You can find all you need in this link, just follow the instructions.

Yes, you can apply this idea to GSC data as well! I am working on it as well, it just takes time to properly clean data, as it is very hard in some niches.

maartengr.github.io/BERTopic/getti…
Visualizing topics is a great way to spot similarity among clusters. This is crucial for large websites or when you have no clue what a new domain is about.

Use this info as a hint on what to topically improve and to see your topical authority. >>>
>>> However, do recall that it's computationally expensive to process all those GSC data for medium websites, imagine for big ones!

There are plenty of topic modelling techniques and you have to get a basic understanding of transformers.
You have way too many options at first, just go through the docs and apply what you can. It will take time but it's totally worth it.

In alternative, you can check this Medium article by the author.
Here you get a "manual" implementation of some feature.

towardsdatascience.com/topic-modeling…
You can just use the library, use the article in the tweet above for practice.

If you're not happy with the models available, you can always load yours.
BERTopic is an excellent way to carry out complex tasks without writing tons of code. Give it a try on Google Colab, totally worth it!
For the pros:

- Variety of models + custom
- Short and neat code
- Plenty of options
This library has no particular cons, they're mostly related to the algorithms:

- often computationally expensive without a proper setup
- using models without tuning is a waste of time

Again, this is not a problem of the library. Studying models is your task!
BERTopic works well with spaCy too, one of the best and most famous NLP libraries in Python.

Python is a good compromise to create your own tools to validate ideas and automate boring stuff.
This library improves your workflow by adding semantic clustering and the possibility to evaluate content networks (given proper tuning).

Data cleaning is the most important step, be sure to take out the trash!
Using this for keyword research is a great idea too, sometimes you can switch to querycat for association rule learning.

The reason is that you don't always need to rely on semantics, don't overcomplicate simple tasks.
I am not the biggest fan of traditional clustering techniques for keywords.

I usually go with either querycat or transformers. As mentioned before, the latter can be super slow with some models and for some datasets.

Be sure to filter out useless keywords!
You have to play a little bit with how many topics you want and the n-grams.

Some models may perform better in certain scenarios, study what's more suitable and practice.

You don't have the burden of writing a lot of code tho
This thread is not supposed to be super technical, although I don't exclude that I will write a long article on the topic in the next future.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Marco Giordano

Marco Giordano Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @GiordMarco96

Jan 5
The MAXI Thread You Need To Understand The Semrush Report.

Since everyone is going crazy over this study and misinterpreting the data...

let me clarify some things 🧵
Let's start right away with the sampling, the most important thing.

It's true that with this sample, we can't generalize or draw conclusions ...

but this doesn't mean that this study is not informative.

We are talking about conclusions, not usefulness.

N.B. check the quoted tweet and download the full report.
As already discussed in my past newsletter, not all the studies can generalize.

But still, an observational study can provide some value by eliciting questions. Image
Read 25 tweets
Oct 11, 2023
Data Skills and Analytics for #SEO Essentials.

The Most Complete Resources Out There, For Free.

💯 Updated version with new content [October 2023]

My 50 best threads and posts in a single place:
Introduction to #nlp to understand how modern search engines work.
The Best SEO Python Libraries You Have to Use.

What I list here is more than necessary to start getting results.
Read 55 tweets
Oct 6, 2023
HCU didn't only bring traffic losses but broadcasted #SEO misinformation.

So many analyses are based on anecdotal evidence, such as:

- My clients didn't get hit
- My website is fine
- It must be the home page!

Let's poke some holes and see what to do instead 🧵
First of all, even 1000 websites aren't representative.

Mind you, not because it's only 1000 but for the sampling method used.

In plain English, you should have some criteria to select which websites you have to showcase.

1000 travel websites aren't diversified.
Claiming that the home page or whatever got you hit means you don't know how Google works.

They consider many factors and not a single one.

This is why it's important to take incremental steps and avoid drastic changes for the sake of it.
Read 25 tweets
Sep 7, 2023
#SEO doesn't have to be checklists.

I love recommending how to improve processes based on insights.

This part of my Data Analysis Workflow is important for anyone.

How to get 2x efficiency and $$ with this simple process 🧵 Image
Before you start, have a look at my older post where I discuss the entire workflow.

Look at the bottom right to see what I am talking about.
Execution is what gets things going.

The difference between reality and possibility.

Your ultimate goal when doing SEO!

A great plan with poor execution is doomed to fail.
Read 14 tweets
Aug 16, 2023
Yesterday I talked about possible tactics in #SEO.

Today I want to analyze flanking:

✅ ideal for small/niche websites against goliaths
✅ easier to see results
✅ hard to counter

Let's see how to execute it in this thread 🧵 Image
Flanking consists in exploiting gaps, finding the weak spots of your competitors and abusing them.

If your competitor is weak in a given cluster, you can overspecialize in that one!

The same applies to segments, audiences, formats, etc.

ANYTHING!
For example, if you are competing against a big travel website you'll certainly lose.

Why? Because they are strong players and you are not.

So how do you do?

Target easy keywords first? Yes but that's partially correct.

I'd do something nastier...
Read 16 tweets
Jul 29, 2023
Content Decay is the silent 💸 killer of content websites.

It's a natural phenomenon that happens to anyone.

No website is immune from it.

The Ultimate Thread For Content Decay: Everything You Need To Know 🧵 Image
Content Decay simply means that your content will lose performance over time.

Everything in nature degrades as time goes by.

The reasons? Multiple and it's a 100% event.

You can't completely prevent it but you reduce its effect.
The bad thing about content decay is that it snowballs and gets much bigger.

This is what happens to many websites that are unable to do proper content management.

It's a problem that affects anyone in the world of SEO.
Read 22 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(