(3/) Building a comprehensive index of the web is a prerequisite to competing in search. First step for Neeva search engine is “downloading the Internet” via Neeva’s crawler (Neevabot)
However, many sites only allow Google and Microsoft unfettered access to crawl/collect info
(4/) These sites disallow everything else by default in their robots.txt file.
At Neeva, we implement a policy of “crawling a site so long as the robots.txt allows GoogleBot and does not specifically disallow Neevabot”. Despite this workaround, ~30% of the web is inaccessible.
(5/) Even when a site allows in an upstart search engine crawler via robots.txt, they block it in other ways, either by throwing various kinds of errors (503s, 429s, …) or rate throttling.
(6/) Many retailer websites do this.
Many big content aggregators do this.
And many smaller sites do this as well.
If you want to crawl these sites, you have to deploy workarounds like “crawling using a bank of proxy IPs that rotate periodically”, which we are loath to do.
(7/) Additional time, cost, and resources just to be treated equally to Big Tech!!
tldr; the web is an actively hostile environment for upstart search engine crawlers like Neevabot, further reducing competition in search.
(9/) Thanks to them, our crawl now runs at hundreds of millions of pages a day, on track to hit billions of pages a day soon.
(10/) Yet, it’s a lot of work when the defaults are “googlebot only”.
And in many cases, impossible.
Moreover, gaining permission to crawl shouldn’t be about who you know.
There should be an equal playing field for anyone competing and following the rules.
(11/) Upstart search engines like @Neeva want a chance to compete with fair and equitable access to the web.
We put a lot of effort into building a well behaved crawler that respects rate limits, and crawls at the minimum rate needed to build a great search engine.
(12/) Meanwhile, Google has carte blanche. It crawls the web 50B pages per day. It visits every page on the web once every 3 days, and taxes network bandwidth on all websites.
This is the monopolist’s tax on the Internet.
(13/) Neeva’s crawler is capable of crawling the web at the speed and depth that Google does.
There are no technical limitations here.
Just market forces.
(14/) To be clear, webmasters are doing the best they can.
(15/) Google is a monopoly in search and sites are faced with an impossible choice. Either let Google crawl them, or don’t show up prominently in Google results.
(16/) And it’s too much additional work to distinguish bad bots that slow down their websites from legitimate search engines that need to crawl them to serve up relevant results.
(17/) In other words, the Google search monopoly causes the Internet at large to reinforce the monopoly by giving Googlebot preferential access.
(18/) Regulators and policymakers need to step in if they care for competition in search.
The market needs "crawl neutrality", similar to net neutrality.
Don't discriminate between search engine crawlers based on who they are.
(19/) No special deals for Googlebot and a different set of rules for everyone else.
(20/) We aren’t asking for anything more than a fair chance to compete.
1/ When someone types “neeva” into search, how do we know they mean “neeva.com” instead of “neevaneevaneeva.com”? After all, the second has 3 times as much neeva!
See how you can do much better than vanilla TF-IDF / cosine similarity for textual relevance!🧵
2/ Textual relevance is only one part of document ranking (alongside signals like centrality, page quality, and click rate)
But it’s one the most important parts and the one we’ll be covering in today’s thread.
3/ The most popular way to rank documents relative to queries is to use TF-IDF vector representation.
Essentially, this claims the more often a term occurs on a page (TF), and the less often it occurs on other pages (IDF) the more likely that term is to be relevant to the page.