Profile picture
Frédéric Dubut @CoperniX
, 19 tweets, 17 min read Read on Twitter
The slides of our #TechSEOBoost presentation "Search and Spam Fighting in the Age of Deep Learning" are live! All the research papers we referenced should be easy to find through @Bing but feel free to ping me if you need direct links. #SEO #TechnicalSEO slideshare.net/CatalystDigita…
Actually I'm thinking it will be more fun if I share the links directly in thread over the next couple of days, along with some commentary 🙂. Starting with "A Statistical Approach to Mechanized Encoding and Searching of Literary Information" by H. P. Luhn web.stanford.edu/class/linguist…
This paper from 1957 introduces the concept of Term Frequency (a.k.a. TF), the simple (yet powerful) idea that the more often a keyword appears in a document, the more relevant that document is in relation to that keyword. #InformationRetrieval #SEO #TechnicalSEO #HistoryOfSearch
Fun fact: the original definition of TF was not normalized by doc length. In Luhn's world of corporate knowledge bases (he worked at @IBM) it was viable but all modern variations of TF have at least some smoothing, normalization or upper bound. #SEO #TechnicalSEO #HistoryOfSearch
Now jumping to 1972 and "A statistical interpretation of term specificity and its application in retrieval" by Karen Spärck Jones. This is the introduction of the fundamental concept of Inverse Document Frequency (a.k.a. IDF). citeseer.ist.psu.edu/viewdoc/summar…
IDF captures the idea that the more common a keyword in the doc corpus, the least predictive of relevance it is. Obvious examples include stopwords (e.g. "the") but it generalizes very well to all words in corpus. #InformationRetrieval #SEO #TechnicalSEO #HistoryOfSearch
The canonical IDF formula is basically the log of inverse freq (number of docs divided by number of docs that contain the keyword), slightly adjusted for corner cases. #InformationRetrieval #SEO #TechnicalSEO #HistoryOfSearch
Very few IR concepts and formulas withstood the test of time as well as IDF. 46 years later, Prof. Spärck Jones' findings are still represented in BM25F, which is itself still considered state-of-the-art today. #InformationRetrieval #SEO #TechnicalSEO #HistoryOfSearch
TF and IDF are very complementary to each other. Together they are combined into TFIDF, the legendary #InformationRetrieval formula - the score of a doc for a given search query is the sum over all the query terms of the product of TF and IDF. #SEO #TechnicalSEO #HistoryOfSearch
We are now in 1994 and years of #InformationRetrieval research led to "Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval" by Robertson and Walker, which introduced BM ("Best Match") formulas, leading to BM25. citeseerx.ist.psu.edu/viewdoc/summar…
BM25 keeps IDF term intact but overhauls TF term. A major issue with TF is that repetitions increase the score linearly but the probability that a doc is relevant doesn't double when a keyword appears 10x instead of 5x. #InformationRetrieval #SEO #TechnicalSEO #HistoryOfSearch
BM25 incorporates strongly diminishing returns as the same keyword gets repeated again and again. See graph below - the first occurrence gives 1pt, 4 next ones give another pt but score is barely 2.5 after 20 occurrrences. #InformationRetrieval #SEO #TechnicalSEO #HistoryOfSearch
Another intuition of BM25 is the attempt to capture the subtleties behind doc length. A short doc can be either "thin" or "to the point". A long doc can be either "verbose" or "comprehensive". #InformationRetrieval #SEO #TechnicalSEO #HistoryOfSearch
Putting everything together, this is how the new TF term looks like in BM25. k and b are carefully tuned constants that capture respectively the importance of repetitions and the weight of doc length in the final score. #InformationRetrieval #SEO #TechnicalSEO #HistoryOfSearch
Ten years later, a team at @MSFTResearch published a "Simple BM25 Extension to Multiple Weighted Fields", a.k.a. BM25F. The idea is that terms appearing in the title or abstract of a doc are more predictive of relevance than those appearing in the body. citeseerx.ist.psu.edu/viewdoc/summar…
The resulting formula is pretty straightforward and involves replacing a handful of the terms within TF' with their weighted sum across fields (title, abstract, body, etc). BM25F is still considered state-of-the-art today. #InformationRetrieval #SEO #TechnicalSEO #HistoryOfSearch
Now switching gears a little bit - search engines started looking at other signals beyond keyword frequencies in order to improve relevance. The most famous: "The PageRank Citation Ranking: Bringing Order to the Web" by Page, Brin et al. citeseerx.ist.psu.edu/viewdoc/summar…
PageRank has been studied and analyzed at length so I won't get too much into details. Here's the canonical formula that captures the idea that pages propagate some of their "authority" to all the other pages they link to. #InformationRetrieval #SEO #TechnicalSEO #HistoryOfSearch
Another fundamental work from 2006 is "Improving Web Search Ranking by Incorporating User Behavior" by Agichtein et al. @MSFTResearch The main idea is that users clicking on search results is valuable implicit feedback that should be factored in the model. microsoft.com/en-us/research…
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Frédéric Dubut
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!