Showing posts with label google. Show all posts
Showing posts with label google. Show all posts

Sunday, April 5, 2015

Is that Googlebot user agent really from Google?

How do you know if the client identifying as Googlebot is actually from Google? And Bingbot? And all the others? On Github there's a newly published library, written in Java with Maven that performs the checks for you. And why would you care?

There are different kinds of visitors to your website. Some webmasters don't mind any of the traffic, don't discriminate, and let them all pass. Others block or at least limit or throttle some of the requests. But why?
In the end it all boils down to the visitors you want, and those that you don't want. Those that bring you potential benefit, and those that only cost you traffic, and possibly worse duplicate your site content, collect email addresses, or align your site with bad reputation.

There are 2 kinds of users visiting your site. Humans, and robots. This article (and the software) is about the bots only.

The robots can be classified as follows:
- search engine crawlers: googlebot, bingbot, baiduspider, ...
- other purpose crawlers: archive.org_bot, ahrefsbot, ...
- custom crawlers: data scrapers, email harvesters, offline readers, ...

Search engines

Search engines operate crawlers to feed their indexes. You generally want to let these through because they drive visitors to your site. The more pages they index, the better for you.

However, there may be exceptions. Dan Birken has an interesting read on this: Baidu is a greedy spider that possibly gives you nothing in return, depending on your target audience.

And these days search engines don't only operate text search, there is also images, videos, and more. Some sites may have not interest in appearing in the Google image search, and therefore lock out the Googlebot-Image crawler.

Other purpose crawlers

The web archive is a nice thing. Some webmasters choose to not have a pubic history of their site, and block the bot.

Commons Crawl is another one that some don't mind, and others forbid, depending on their site content.

Blocking blood suckers including the ahrefsbot, a6, ADmantX and more has a larger fan base.

And finally blocking email harvesters should be on every webmaster's menu.

Custom crawlers

This is software from the shelf or even custom tailored to grab parts or your entire website. These bots are not run from a specific website or service provider. Any script kiddie can fire them up. If you operate a blog, on a shared hosting with unlimited traffic, you probably don't mind. In the best case you end up being listed on some blogging directory. In a worse case parts of your content appear remixed on a seo index spammer's fake portal, alined with links to sites of bad reputation.

The problem: impersonators

As long as all visitors identify with the correct and honest user-agent they are, the task is simple. You define in your /robots.txt which crawlers are allowed. Task accomplished. But as we all know (that's where you nod), the user-agent is just and http header field that anyone can modify. Even and especially from PHP. Pun intended.

Some masquerade as a common web browser, eg Chrome or Firefox. It's a bit more tricky to identify these, but there are ways.

The well known spider trap trick can help: a link in the html to /dont-go-here that humans can't see thanks to CSS, and that nice bots won't follow because you forbid the area in the robots.txt. Then, if someone hits that page, he's either a vampire bot, or a curious web developer who views the source, or a blind person agnostic to CSS.

Other ideas are limiting the amount of web traffic per user (ip/network, session). Some sites implement a Captcha after too many requests per time.

Other custom crawlers impersonate as one of the well known spiders; Googlebot. Chances are there are no request limits in place. After all who wants to risk being deleted from the Google index.

The solution: spider identification

That's where this software library comes in. The users that identify as one of the top search sites need to be verified.

For a long time the way of identification used to be by maintaining ip lists. Good news! For many years the major players have been supporting and advocating a zero-maintenance system. It works by validating the ip with a reverse and forward DNS lookup.

For those search engines and sites that don't support this method, identification still needs ip lists.
History has shown that this way is problematic. The ip ranges change, the lists always lag behind. Search for lists on Google and you'll find totally outdated lists in the tops. Lists that still mention msnbot (now bingbot) and Yahoo! Slurp (now defunct).

A library, used as a dependency by many developers, hosted on Github where it's easy to contribute, hopefully solves that problem.

Actions taken on spoofers

What action you take if verification returns negative is up to you. Block the user, throttle the connection, display a Captcha, alert the webmaster.

You can run the verification check live as the request comes in. There's an initial price you pay on the first request for the DNS lookups. Or you can run a watchdog on your webserver's logfiles and take action a bit delayed.

Call for action

Head over to https://github.com/optimaize/webcrawler-verifier and replace your custom Googlebot verification hack, using outdated ip lists, with this dependency.

Monday, January 28, 2013

When to use Gmail's SMTP in your app and when not



This post is about the very legitimate automated emails generated by any application, such as for transactions and signups. Getting these delivered to the inbox (instead of spambox or nirvana) is the goal. 

The short answer: 

Sooner or later you may hit the sending limit. It's not when you deploy, and maybe not while beta testing. So better know the limits, risks and alternatives.

The longer answer:

If your sender address is hosted on Gmail then using their SMTP server is the obvious choice because:
  • You already have access to it
  • High availability
  • Unlikely that Gmail's smtp servers get added to block lists such as this one

How to connect to Gmail SMTP

All you need is the following information, and an SMTP library for your programming language:
  • host: smtp.gmail.com
  • port: 465
  • ssl: yes
  • user:
  • password:
The email address you use to log in must be a real account, not an alias. If you prefer to send the address from a different address, use the "from" field. Not modifying the "from" and "reply-to" increases your chances of getting delivered to the inbox.

You test it, email delivers, so you deploy your app, done.

Errors to expect

Soon after, your beta testers report to not receive any email. Checking your logs finds errors such as "550 5.4.5 Daily sending quota exceeded." or "535, response: 5.7.1 Please log in with your web browser and then try again.". What happened?

3 types of Gmail accounts

There are three types of Gmail addresses:
  1. The common Gmail domains: anything @ gmail
  2. Google Apps for free: yourname @ yourdomain
  3. Google Apps pro: yourname @ yourdomain
If you go and sign up for a new Gmail address from your desktop, then try sending mail from your (remote) server location, you'll get the 535 error quickly (it's the typical spammer pattern). You need to verify your account by SMS, and mailing goes on. For a short moment. It appears that such accounts can only send to a handful of different email addresses per day. I was not able to find official statements and numbers. The number is probably so low for new accounts, so if you have an established one it may work longer.

If you have your own domain set up for Gmail then the limit is higher. It makes sense since you have a public whois record. Google disabled signups for the free apps service a couple weeks ago. That's probably why I cannot find official information about the limits. The number of recipients per day is quoted as 500 on the internet. If you have such an account already then you can continue using it.

For the paid account the official page says 2000 unique, external recipients per day. 

Other risks

The official page has another fact:
"The value of these limits may change without notice in order to protect Google’s infrastructure"
Also, I've found unofficial/unverified information about Google lowering the daily send limit on high bounce rates. This makes perfect sense; spammers have high bounce rates. This is an open door to malicious users of your app: sign up with a couple invalid addresses, and your email system may be interrupted for a while.


Using your own SMTP

If you decide now that Gmail SMTP is not for you, there are some things to consider with your own.

If you access the SMTP of your provider, then you may face similar limits there. After all your provider has to make sure their customers don't spam, and not the whole server gets blacklisted. But this can and does happen nevertheless: Either because one of the other users spammed, or because one account was hijacked and abused. As a result your email messages may be accepted by the SMTP, but never make it to their destination. Maybe you get bounces, maybe not.

Be sure to create a Sender Policy Framework record.

Conclusions

A combined approach

I still believe that using Gmail's outgoing mailserver has its advantages. They are reliable, and in case of denial they return clear status codes. A solution with Gmail as primary, and your server as fallback, sounds like a good idea to me.

Further reading

Google's Bulk Senders Guidelines has more useful information to get delivered.

Not for marketing

Given the limits and risks, I'd definitely not use Gmail for sending anything that could be marked as spam by the receivers. Marketing, newsletters, even if the user at some point actively asked for it. Only send high priority mail such as transaction confirmations though Gmail SMTP.

Thursday, November 1, 2012

Google: from Innovator to Average Quality Copycat?

Even though Google keeps shutting down products that did not take off or that don't pay off, they still pursue way too many new businesses and new products to become great at any one of them. Well, some are great. A few. But more and more are becoming just average.

Why is that?

Apparently Google has reached a critical size, and maintaining the existing stack demands huge resources. For developing new products, such as Google Drive, there just aren't enough a-list candidates available. And there is competition on the job market. While 10 years ago any candidate would choose Google over Microsoft, there are far more attractive players on the market now. Plus: starting your own business has never been cheaper.

At the same time other, agile players appear on the market, focusing on just one product.

Why do I use Google products even though there are viable alternatives?

  • Almost 100% uptime.
  • It's usually above average in functionality. 
  • The free version is not crippled beyond usefulness.
  • It's integrated with all the other products, same login, 1 click away.
  • "Don't be evil" motto. They don't run out of money and need to find new sources of income.
  • I don't want to learn a new product every 2 years. A new player might disappear quickly when running out of fundings. 

3 product categories

Products, not just for software, can be grouped into 3 categories:
  1. Those that make you freak out from time to time.
  2. Those that just work as they should.
  3. Those that, besides working, have some awesome things that you wouldn't even think of, but once you see them, instantly understand and love.
Microsoft products used to be in group 1. Your MS Word crashed and you didn't have the document stored? Windows blue screen?

Google was a breeze. Google products kept coming out with those category 3 features. Lots of innovation. Page rank. Text ads. Chrome is full of them. The software update process for example. Or how chrome opens tabs next to the current one instead of at the end. It took one instant, one click, and my thought was "Omg, why did no one think of this before!".

Enters Google Drive

Recently, Google products dropped to category 2. And the reason for this blog post is Google Drive, which is a clear category 1 product, and should be labeled Beta at most. But the Windows app "about" message gives it a stable version 1.5, and Wikipedia labels it as "stable". It only came out 5 months ago, but because Google entered that profitable market late, they apparently can't afford to label an "online backup service" as beta at this time.

Google Docs was good. My experience with the Google Drive Windows tool is bad so far. On the first day, after installation, and copying 2GB of data into the folder, it could not synchronize all files (had to re-try several times), gave useless error messages "an unknown issue occurred", crashed twice.
Now, a couple weeks later, I added a new folder with a 50MB file in it. Then I wanted to rename that folder, and Windows Explorer told me I can't because the file is locked (that must have been Google Drive). So instead (I did not wait) I created a new folder, and just moved the file itself (that worked). Job done, I thought. The next day, after waking Windows up from sleep mode, Google Drive tells me it crashed. It doesn't restart automatically. So I better check if my files are synchronized. As I drill down the folders I notice that they all have the green checkmark icon (synchronized). Arriving at the leaf, surprise surprise, the file is not synchronized.

Only now I realize that the web is full of articles about how Google Drive sucks, of people claiming that they lost files. A 26% hate rate is devastating (whereas Dropbox reaches almost 100% love rate from 600k people).

My experience with Dropbox

Way better. I've been using it for almost 2 years. Transfer rates are much slower, but that's not an important criteria for me. While using it almost daily, it never crashed, and I experienced the following 2 issues:
  1. Sometimes, when opening an MS Excel document (yes I still have some of those), Dropbox updates the change timestamp. This sucks, I don't want an entry in the changelog.
  2. One time, after waking up another device that was not synchronized in weeks, Dropbox thought that some files were conflicting, even though no one touched them in the meantime, and their content was identical too. What Dropbox did was keeping both "versions", one with a different file name (conflict and timestamp). Annoying, but no data lost.

My conclusion: Don't use Google Drive as a backup solution, have another backup somewhere. Or best, don't use Google Drive except for the Google Docs online documents.

A word to Google 

Please focus on the important products, innovate, and don't shut down the ones I'm using. And keep working on the robot car, I can't wait, the world will be a better place.

A word to Google Skeptics

This is good news for those who were concerned that Google would grow too big, take over the world, and know all about you. Nah. Remember the talks about breaking up Microsoft because it was too powerful, 12 years ago?