Neeva Profile picture
Jun 6 20 tweets 9 min read
(1/) Competing in search starts from crawling the web. You can't serve up search results if you can't crawl the web.

Crawling the web is very hard today because of the discriminatory and #anticompetitive nature of how crawlers are treated on the web. 🧵
(3/) Building a comprehensive index of the web is a prerequisite to competing in search. First step for Neeva search engine is “downloading the Internet” via Neeva’s crawler (Neevabot)

However, many sites only allow Google and Microsoft unfettered access to crawl/collect info
(4/) These sites disallow everything else by default in their robots.txt file.

At Neeva, we implement a policy of “crawling a site so long as the robots.txt allows GoogleBot and does not specifically disallow Neevabot”. Despite this workaround, ~30% of the web is inaccessible.
(5/) Even when a site allows in an upstart search engine crawler via robots.txt, they block it in other ways, either by throwing various kinds of errors (503s, 429s, …) or rate throttling.
(6/) Many retailer websites do this.

Many big content aggregators do this.

And many smaller sites do this as well.

If you want to crawl these sites, you have to deploy workarounds like “crawling using a bank of proxy IPs that rotate periodically”, which we are loath to do.
(7/) Additional time, cost, and resources just to be treated equally to Big Tech!!

tldr; the web is an actively hostile environment for upstart search engine crawlers like Neevabot, further reducing competition in search.
(8/) Fortunately, a set of well wishers, webmasters and well meaning publishers have helped us get whitelisted. (Thanks to @Cloudflare and our friends at @Quora, @LinkedIn, @Reddit, @ Medium, @YouTube, @GitHub, @Amazon, @Meta, @Twitter and many other sites).
(9/) Thanks to them, our crawl now runs at hundreds of millions of pages a day, on track to hit billions of pages a day soon.
(10/) Yet, it’s a lot of work when the defaults are “googlebot only”.

And in many cases, impossible.

Moreover, gaining permission to crawl shouldn’t be about who you know.

There should be an equal playing field for anyone competing and following the rules.
(11/) Upstart search engines like @Neeva want a chance to compete with fair and equitable access to the web.

We put a lot of effort into building a well behaved crawler that respects rate limits, and crawls at the minimum rate needed to build a great search engine.
(12/) Meanwhile, Google has carte blanche. It crawls the web 50B pages per day. It visits every page on the web once every 3 days, and taxes network bandwidth on all websites.

This is the monopolist’s tax on the Internet.
(13/) Neeva’s crawler is capable of crawling the web at the speed and depth that Google does.

There are no technical limitations here.

Just market forces.
(14/) To be clear, webmasters are doing the best they can.
(15/) Google is a monopoly in search and sites are faced with an impossible choice. Either let Google crawl them, or don’t show up prominently in Google results.
(16/) And it’s too much additional work to distinguish bad bots that slow down their websites from legitimate search engines that need to crawl them to serve up relevant results.
(17/) In other words, the Google search monopoly causes the Internet at large to reinforce the monopoly by giving Googlebot preferential access.
(18/) Regulators and policymakers need to step in if they care for competition in search.

The market needs "crawl neutrality", similar to net neutrality.

Don't discriminate between search engine crawlers based on who they are.
(19/) No special deals for Googlebot and a different set of rules for everyone else.
(20/) We aren’t asking for anything more than a fair chance to compete.

To learn more about us, follow @Neeva and visit neeva.com

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Neeva

Neeva Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Neeva

Jun 7
1/ When someone types “neeva” into search, how do we know they mean “neeva.com” instead of “neevaneevaneeva.com”? After all, the second has 3 times as much neeva!

See how you can do much better than vanilla TF-IDF / cosine similarity for textual relevance!🧵 Image
2/ Textual relevance is only one part of document ranking (alongside signals like centrality, page quality, and click rate)

But it’s one the most important parts and the one we’ll be covering in today’s thread.
3/ The most popular way to rank documents relative to queries is to use TF-IDF vector representation.

Essentially, this claims the more often a term occurs on a page (TF), and the less often it occurs on other pages (IDF) the more likely that term is to be relevant to the page.
Read 16 tweets
Apr 6
The Next Google? @dmitribrereton author of “Google Search is Dying” calls @Neeva “The Everything Engine.”🔎

✅ Transparency, agency and deep integrations.

No matter what you’re looking for, or where it is, Neeva will help you find it, often with only one click. 🕵️‍♂️

More ⬇️
Search across the accounts you choose to connect to Neeva and the web, all from one search box. 🔎

See results from Gmail, G Drive, Dropbox, Office 365, Slack, Github, and more. All with one click. Image
Since Google is ad-based, they want you to see search results where they can show you pages of ads.

Neeva can get you *instant* inline answers with FastTap, and get you directly to the page you’re looking for.

Faster than the blink of an eye. ⚡️
Read 7 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(