Tweet

Neeva

Follow @Neeva

Sep 23 • 10 tweets • 5 min read

@openai

1/ It's not the size, it's the skill - now releasing #Neeva's Query Embedding Model!

Our query embedding model beats @openai’s Curie which is orders of magnitude bigger and 100000x more expensive. 🤯

Keep reading to find out how... 📖

2/ Query understanding is the life blood of #searchengines. Large search engines spent millions of SWE hours building various signals like synonymy, spelling, term weighting, compounds, etc.

We don’t have that luxury. 🙄

Fortunately for us, #LLMs are here to build upon.

3/ We solve the problem of #query similarity: when 2 user queries looking for the same information on the web.

Why is this useful? Query-click data for web docs = strongest signal for search, QA, etc.; solving query equivalence => smear click signal over lots of user queries

4/ Not so obvious? Query equivalence is a suitcase problem. 🛅

Once unpacked, it involves solving many semantic understanding problems.

Most importantly, it involves understanding the myriad ways in which people talk to #searchengines.

5/ We use a #BERT model to encode queries in a 384 dimension space and use dot products in this space to compute a query equivalence score.

We use sentence BERT (sbert.net) as a starting point.

The main question is how do you train this model? 🤔

6/ Answer: We use a trick to generate training data.

We created query pairs that have overlapping results in their top 5 and generated a “soft label” for query pairs similarity = #{overlapping results in top 5}/5 (labels = 0.0, 0.2, 0.4, 0.6, 0.8, 1.0).

7/ Now we have trained a biencoder model, minimizing the l2 distance between soft labels and the cosine similarity of queries.

8/ Mining hard negatives is one of the trickiest and noisiest aspects of similarity/contrastive learning. Our trick lets us get around this by using soft labels based on web result overlap.

⭐ Bonus! Our predicted similarities end up being more calibrated in the [0,1] range.

9/ Training this model on our soft label data creates a state of the art model. All by using domain specific knowledge and lots of labeled data.

@huggingface

10/ We are releasing our model (huggingface.co/neeva/query2qu…) and golden set used for eval (huggingface.co/datasets/neeva…) on @huggingface.

Take a look at our latest blog post for more information 👁️ ⤵️ neeva.com/blog/state-of-…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @Neeva

Neeva

@Neeva

Jul 7

1/ Google will do just about anything to maintain its monopoly power

Fear-inducing pop-ups with misleading designs to trick users into going back to Google search ✅

What’s a competing search engine to do? The only thing we can…Design and innovate our way out of it! 🧵

2/ By default, Google Chrome comes with Google search – no surprise there.

However, if a user prefers a more privacy focused search engine, they have to jump through a few hoops to install an extension and make it as the default.

All in all not terribly difficult so far, but…

3/ The last thing Google wants is to lose a user, especially from their cash cow – search.

So, under the guise of security, upon installing a new search extension such as Neeva and attempting your first search from the omnibar, they deploy the misleading warning prompt.

Read 16 tweets

Neeva

@Neeva

Jul 5

(1/) You know the drill. Head to a website, and get hit with a pop up asking to enable cookies.

Too often we just click “accept all” to get rid of the prompt, privacy be damned

Rather than 🤷‍♀️, we set out to get rid of these annoying banners & keeping your privacy intact.

🧵

(2/) Why should you care about cookies?

Sites save these small text files on your device and use them to track you.

Little crumbs that track sites you visit, what you shop for, and even what you search for.

That private & valuable info is sold and ads are targeted back at you.

(3/) Targeted ads make our skin crawl at Neeva, so we built Cookie Cutter to kick these invasive cookies to the curb.

So why not just block all cookies?

Unfortunately it’s not so easy, sites have to use some cookies, for example, to log you in or remember your shopping cart.

Read 14 tweets

Neeva

@Neeva

Jun 7

1/ When someone types “neeva” into search, how do we know they mean “neeva.com” instead of “neevaneevaneeva.com”? After all, the second has 3 times as much neeva!

See how you can do much better than vanilla TF-IDF / cosine similarity for textual relevance!🧵

2/ Textual relevance is only one part of document ranking (alongside signals like centrality, page quality, and click rate)

But it’s one the most important parts and the one we’ll be covering in today’s thread.

3/ The most popular way to rank documents relative to queries is to use TF-IDF vector representation.

Essentially, this claims the more often a term occurs on a page (TF), and the less often it occurs on other pages (IDF) the more likely that term is to be relevant to the page.

Read 16 tweets

Neeva

@Neeva

Jun 6

(1/) Competing in search starts from crawling the web. You can't serve up search results if you can't crawl the web.

Crawling the web is very hard today because of the discriminatory and #anticompetitive nature of how crawlers are treated on the web. 🧵

@matthewstoller

(2/) We need "crawl neutrality" on the Internet, similar to net neutrality
@matthewstoller @econliberties @EFF @superwuster @amyklobuchar @ChuckGrassley @RepAnnaEshoo, @Deese44 @OmidyarNetwork @accountabletech, @ChuckGrassley, @lutherlowe @Public_Citizen @publicknowledge @amprog

(3/) Building a comprehensive index of the web is a prerequisite to competing in search. First step for Neeva search engine is “downloading the Internet” via Neeva’s crawler (Neevabot)

However, many sites only allow Google and Microsoft unfettered access to crawl/collect info

Read 20 tweets

Neeva

@Neeva

Apr 6

@dmitribrereton

The Next Google? @dmitribrereton author of “Google Search is Dying” calls @Neeva “The Everything Engine.”🔎

✅ Transparency, agency and deep integrations.

No matter what you’re looking for, or where it is, Neeva will help you find it, often with only one click. 🕵️‍♂️

More ⬇️

Search across the accounts you choose to connect to Neeva and the web, all from one search box. 🔎

See results from Gmail, G Drive, Dropbox, Office 365, Slack, Github, and more. All with one click.

Since Google is ad-based, they want you to see search results where they can show you pages of ads.

Neeva can get you *instant* inline answers with FastTap, and get you directly to the page you’re looking for.

Faster than the blink of an eye. ⚡️