Sunday, April 5, 2015

Is that Googlebot user agent really from Google?

How do you know if the client identifying as Googlebot is actually from Google? And Bingbot? And all the others? On Github there's a newly published library, written in Java with Maven that performs the checks for you. And why would you care?

There are different kinds of visitors to your website. Some webmasters don't mind any of the traffic, don't discriminate, and let them all pass. Others block or at least limit or throttle some of the requests. But why?
In the end it all boils down to the visitors you want, and those that you don't want. Those that bring you potential benefit, and those that only cost you traffic, and possibly worse duplicate your site content, collect email addresses, or align your site with bad reputation.

There are 2 kinds of users visiting your site. Humans, and robots. This article (and the software) is about the bots only.

The robots can be classified as follows:
- search engine crawlers: googlebot, bingbot, baiduspider, ...
- other purpose crawlers: archive.org_bot, ahrefsbot, ...
- custom crawlers: data scrapers, email harvesters, offline readers, ...

Search engines

Search engines operate crawlers to feed their indexes. You generally want to let these through because they drive visitors to your site. The more pages they index, the better for you.

However, there may be exceptions. Dan Birken has an interesting read on this: Baidu is a greedy spider that possibly gives you nothing in return, depending on your target audience.

And these days search engines don't only operate text search, there is also images, videos, and more. Some sites may have not interest in appearing in the Google image search, and therefore lock out the Googlebot-Image crawler.

Other purpose crawlers

The web archive is a nice thing. Some webmasters choose to not have a pubic history of their site, and block the bot.

Commons Crawl is another one that some don't mind, and others forbid, depending on their site content.

Blocking blood suckers including the ahrefsbot, a6, ADmantX and more has a larger fan base.

And finally blocking email harvesters should be on every webmaster's menu.

Custom crawlers

This is software from the shelf or even custom tailored to grab parts or your entire website. These bots are not run from a specific website or service provider. Any script kiddie can fire them up. If you operate a blog, on a shared hosting with unlimited traffic, you probably don't mind. In the best case you end up being listed on some blogging directory. In a worse case parts of your content appear remixed on a seo index spammer's fake portal, alined with links to sites of bad reputation.

The problem: impersonators

As long as all visitors identify with the correct and honest user-agent they are, the task is simple. You define in your /robots.txt which crawlers are allowed. Task accomplished. But as we all know (that's where you nod), the user-agent is just and http header field that anyone can modify. Even and especially from PHP. Pun intended.

Some masquerade as a common web browser, eg Chrome or Firefox. It's a bit more tricky to identify these, but there are ways.

The well known spider trap trick can help: a link in the html to /dont-go-here that humans can't see thanks to CSS, and that nice bots won't follow because you forbid the area in the robots.txt. Then, if someone hits that page, he's either a vampire bot, or a curious web developer who views the source, or a blind person agnostic to CSS.

Other ideas are limiting the amount of web traffic per user (ip/network, session). Some sites implement a Captcha after too many requests per time.

Other custom crawlers impersonate as one of the well known spiders; Googlebot. Chances are there are no request limits in place. After all who wants to risk being deleted from the Google index.

The solution: spider identification

That's where this software library comes in. The users that identify as one of the top search sites need to be verified.

For a long time the way of identification used to be by maintaining ip lists. Good news! For many years the major players have been supporting and advocating a zero-maintenance system. It works by validating the ip with a reverse and forward DNS lookup.

For those search engines and sites that don't support this method, identification still needs ip lists.
History has shown that this way is problematic. The ip ranges change, the lists always lag behind. Search for lists on Google and you'll find totally outdated lists in the tops. Lists that still mention msnbot (now bingbot) and Yahoo! Slurp (now defunct).

A library, used as a dependency by many developers, hosted on Github where it's easy to contribute, hopefully solves that problem.

Actions taken on spoofers

What action you take if verification returns negative is up to you. Block the user, throttle the connection, display a Captcha, alert the webmaster.

You can run the verification check live as the request comes in. There's an initial price you pay on the first request for the DNS lookups. Or you can run a watchdog on your webserver's logfiles and take action a bit delayed.

Call for action

Head over to https://github.com/optimaize/webcrawler-verifier and replace your custom Googlebot verification hack, using outdated ip lists, with this dependency.