1/ It's not the size, it's the skill - now releasing #Neeva's Query Embedding Model!
Our query embedding model beats @openai’s Curie which is orders of magnitude bigger and 100000x more expensive. 🤯
Keep reading to find out how... 📖
2/ Query understanding is the life blood of #searchengines. Large search engines spent millions of SWE hours building various signals like synonymy, spelling, term weighting, compounds, etc.
3/ We solve the problem of #query similarity: when 2 user queries looking for the same information on the web.
Why is this useful? Query-click data for web docs = strongest signal for search, QA, etc.; solving query equivalence => smear click signal over lots of user queries
4/ Not so obvious? Query equivalence is a suitcase problem. 🛅
Once unpacked, it involves solving many semantic understanding problems.
Most importantly, it involves understanding the myriad ways in which people talk to #searchengines.
5/ We use a #BERT model to encode queries in a 384 dimension space and use dot products in this space to compute a query equivalence score.
We use sentence BERT (sbert.net) as a starting point.
The main question is how do you train this model? 🤔
6/ Answer: We use a trick to generate training data.
We created query pairs that have overlapping results in their top 5 and generated a “soft label” for query pairs similarity = #{overlapping results in top 5}/5 (labels = 0.0, 0.2, 0.4, 0.6, 0.8, 1.0).
7/ Now we have trained a biencoder model, minimizing the l2 distance between soft labels and the cosine similarity of queries.
8/ Mining hard negatives is one of the trickiest and noisiest aspects of similarity/contrastive learning. Our trick lets us get around this by using soft labels based on web result overlap.
⭐ Bonus! Our predicted similarities end up being more calibrated in the [0,1] range.
9/ Training this model on our soft label data creates a state of the art model. All by using domain specific knowledge and lots of labeled data.
1/ Google will do just about anything to maintain its monopoly power
Fear-inducing pop-ups with misleading designs to trick users into going back to Google search ✅
What’s a competing search engine to do? The only thing we can…Design and innovate our way out of it! 🧵
2/ By default, Google Chrome comes with Google search – no surprise there.
However, if a user prefers a more privacy focused search engine, they have to jump through a few hoops to install an extension and make it as the default.
All in all not terribly difficult so far, but…
3/ The last thing Google wants is to lose a user, especially from their cash cow – search.
So, under the guise of security, upon installing a new search extension such as Neeva and attempting your first search from the omnibar, they deploy the misleading warning prompt.
1/ When someone types “neeva” into search, how do we know they mean “neeva.com” instead of “neevaneevaneeva.com”? After all, the second has 3 times as much neeva!
See how you can do much better than vanilla TF-IDF / cosine similarity for textual relevance!🧵
2/ Textual relevance is only one part of document ranking (alongside signals like centrality, page quality, and click rate)
But it’s one the most important parts and the one we’ll be covering in today’s thread.
3/ The most popular way to rank documents relative to queries is to use TF-IDF vector representation.
Essentially, this claims the more often a term occurs on a page (TF), and the less often it occurs on other pages (IDF) the more likely that term is to be relevant to the page.
(3/) Building a comprehensive index of the web is a prerequisite to competing in search. First step for Neeva search engine is “downloading the Internet” via Neeva’s crawler (Neevabot)
However, many sites only allow Google and Microsoft unfettered access to crawl/collect info