Tweet

Neeva

Follow @Neeva

Jun 6 • 20 tweets • 9 min read

(1/) Competing in search starts from crawling the web. You can't serve up search results if you can't crawl the web.

Crawling the web is very hard today because of the discriminatory and #anticompetitive nature of how crawlers are treated on the web. 🧵

@matthewstoller

(2/) We need "crawl neutrality" on the Internet, similar to net neutrality
@matthewstoller @econliberties @EFF @superwuster @amyklobuchar @ChuckGrassley @RepAnnaEshoo, @Deese44 @OmidyarNetwork @accountabletech, @ChuckGrassley, @lutherlowe @Public_Citizen @publicknowledge @amprog

(3/) Building a comprehensive index of the web is a prerequisite to competing in search. First step for Neeva search engine is “downloading the Internet” via Neeva’s crawler (Neevabot)

However, many sites only allow Google and Microsoft unfettered access to crawl/collect info

(4/) These sites disallow everything else by default in their robots.txt file.

At Neeva, we implement a policy of “crawling a site so long as the robots.txt allows GoogleBot and does not specifically disallow Neevabot”. Despite this workaround, ~30% of the web is inaccessible.

(5/) Even when a site allows in an upstart search engine crawler via robots.txt, they block it in other ways, either by throwing various kinds of errors (503s, 429s, …) or rate throttling.

(6/) Many retailer websites do this.

Many big content aggregators do this.

And many smaller sites do this as well.

If you want to crawl these sites, you have to deploy workarounds like “crawling using a bank of proxy IPs that rotate periodically”, which we are loath to do.

(7/) Additional time, cost, and resources just to be treated equally to Big Tech!!

tldr; the web is an actively hostile environment for upstart search engine crawlers like Neevabot, further reducing competition in search.

@Cloudflare

(8/) Fortunately, a set of well wishers, webmasters and well meaning publishers have helped us get whitelisted. (Thanks to @Cloudflare and our friends at @Quora, @LinkedIn, @Reddit, @ Medium, @YouTube, @GitHub, @Amazon, @Meta, @Twitter and many other sites).

(9/) Thanks to them, our crawl now runs at hundreds of millions of pages a day, on track to hit billions of pages a day soon.

(10/) Yet, it’s a lot of work when the defaults are “googlebot only”.

And in many cases, impossible.

Moreover, gaining permission to crawl shouldn’t be about who you know.

There should be an equal playing field for anyone competing and following the rules.

@Neeva

(11/) Upstart search engines like @Neeva want a chance to compete with fair and equitable access to the web.

We put a lot of effort into building a well behaved crawler that respects rate limits, and crawls at the minimum rate needed to build a great search engine.

(12/) Meanwhile, Google has carte blanche. It crawls the web 50B pages per day. It visits every page on the web once every 3 days, and taxes network bandwidth on all websites.

This is the monopolist’s tax on the Internet.

(13/) Neeva’s crawler is capable of crawling the web at the speed and depth that Google does.

There are no technical limitations here.

Just market forces.

(14/) To be clear, webmasters are doing the best they can.

(15/) Google is a monopoly in search and sites are faced with an impossible choice. Either let Google crawl them, or don’t show up prominently in Google results.

(16/) And it’s too much additional work to distinguish bad bots that slow down their websites from legitimate search engines that need to crawl them to serve up relevant results.

(17/) In other words, the Google search monopoly causes the Internet at large to reinforce the monopoly by giving Googlebot preferential access.

(18/) Regulators and policymakers need to step in if they care for competition in search.

The market needs "crawl neutrality", similar to net neutrality.

Don't discriminate between search engine crawlers based on who they are.

(19/) No special deals for Googlebot and a different set of rules for everyone else.

@Neeva

(20/) We aren’t asking for anything more than a fair chance to compete.

To learn more about us, follow @Neeva and visit neeva.com

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Neeva

People who liked this thread also liked...

Try unrolling a thread yourself!

More from @Neeva

Neeva

Neeva

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?