Sunday, April 5, 2015

Is that Googlebot user agent really from Google?

How do you know if the client identifying as Googlebot is actually from Google? And Bingbot? And all the others? On Github there's a newly published library, written in Java with Maven that performs the checks for you. And why would you care?

There are different kinds of visitors to your website. Some webmasters don't mind any of the traffic, don't discriminate, and let them all pass. Others block or at least limit or throttle some of the requests. But why?
In the end it all boils down to the visitors you want, and those that you don't want. Those that bring you potential benefit, and those that only cost you traffic, and possibly worse duplicate your site content, collect email addresses, or align your site with bad reputation.

There are 2 kinds of users visiting your site. Humans, and robots. This article (and the software) is about the bots only.

The robots can be classified as follows:
- search engine crawlers: googlebot, bingbot, baiduspider, ...
- other purpose crawlers: archive.org_bot, ahrefsbot, ...
- custom crawlers: data scrapers, email harvesters, offline readers, ...

Search engines

Search engines operate crawlers to feed their indexes. You generally want to let these through because they drive visitors to your site. The more pages they index, the better for you.

However, there may be exceptions. Dan Birken has an interesting read on this: Baidu is a greedy spider that possibly gives you nothing in return, depending on your target audience.

And these days search engines don't only operate text search, there is also images, videos, and more. Some sites may have not interest in appearing in the Google image search, and therefore lock out the Googlebot-Image crawler.

Other purpose crawlers

The web archive is a nice thing. Some webmasters choose to not have a pubic history of their site, and block the bot.

Commons Crawl is another one that some don't mind, and others forbid, depending on their site content.

Blocking blood suckers including the ahrefsbot, a6, ADmantX and more has a larger fan base.

And finally blocking email harvesters should be on every webmaster's menu.

Custom crawlers

This is software from the shelf or even custom tailored to grab parts or your entire website. These bots are not run from a specific website or service provider. Any script kiddie can fire them up. If you operate a blog, on a shared hosting with unlimited traffic, you probably don't mind. In the best case you end up being listed on some blogging directory. In a worse case parts of your content appear remixed on a seo index spammer's fake portal, alined with links to sites of bad reputation.

The problem: impersonators

As long as all visitors identify with the correct and honest user-agent they are, the task is simple. You define in your /robots.txt which crawlers are allowed. Task accomplished. But as we all know (that's where you nod), the user-agent is just and http header field that anyone can modify. Even and especially from PHP. Pun intended.

Some masquerade as a common web browser, eg Chrome or Firefox. It's a bit more tricky to identify these, but there are ways.

The well known spider trap trick can help: a link in the html to /dont-go-here that humans can't see thanks to CSS, and that nice bots won't follow because you forbid the area in the robots.txt. Then, if someone hits that page, he's either a vampire bot, or a curious web developer who views the source, or a blind person agnostic to CSS.

Other ideas are limiting the amount of web traffic per user (ip/network, session). Some sites implement a Captcha after too many requests per time.

Other custom crawlers impersonate as one of the well known spiders; Googlebot. Chances are there are no request limits in place. After all who wants to risk being deleted from the Google index.

The solution: spider identification

That's where this software library comes in. The users that identify as one of the top search sites need to be verified.

For a long time the way of identification used to be by maintaining ip lists. Good news! For many years the major players have been supporting and advocating a zero-maintenance system. It works by validating the ip with a reverse and forward DNS lookup.

For those search engines and sites that don't support this method, identification still needs ip lists.
History has shown that this way is problematic. The ip ranges change, the lists always lag behind. Search for lists on Google and you'll find totally outdated lists in the tops. Lists that still mention msnbot (now bingbot) and Yahoo! Slurp (now defunct).

A library, used as a dependency by many developers, hosted on Github where it's easy to contribute, hopefully solves that problem.

Actions taken on spoofers

What action you take if verification returns negative is up to you. Block the user, throttle the connection, display a Captcha, alert the webmaster.

You can run the verification check live as the request comes in. There's an initial price you pay on the first request for the DNS lookups. Or you can run a watchdog on your webserver's logfiles and take action a bit delayed.

Call for action

Head over to and replace your custom Googlebot verification hack, using outdated ip lists, with this dependency.

Thursday, January 29, 2015

Manual PHP Installation on Windows: for Masochists only?

Yesterday I needed to quickly apply some changes to a PHP library hosted on GitHub. I haven't used PHP since I switched to a new PC, so there's some setup to do. But for me as a former PHP developer it won't take too long, will it?

This article won't teach you how to install PHP on Windows. It illustrates my experience, the troubles I went through, and suggests improvements to the process.

The requirements: PHP and PHPUnit

Because I should test my code before committing, I need a runtime environment. IntelliJ IDEA (the big brother of PhpStorm) is here already. What's left is PHP and PHPUnit. The library to lay my hands on doesn't require web. No web server, just scripting.

Step 1:  Install PHP


Choosing the right version

Visiting, I learn that the latest releases are 5.4.37, 5.6.5 and 5.5.21. Hrm. What to choose? The downloads page tells me that the current stable is 5.6.5, so probably that. Just to be sure, I'd like to know a bit more about the compatibility and feature sets of these versions. On the PHP page, including the documentation, I don't find that kind of data. Wikipedia is my friend: so the 5.6 branch it is.

There are more choices to be done on

32 or 64 bit? Thread safe or not? Zip or Debug Pack? Or download source code?
Ah, there's a helping hand on the left "Which version do I choose?"
Since there isn't an easy answer, I go with the rule-out logic to limit the choices.
  1. IIS? no.
  2. Apache? no.
  3. VC9 and VC11? There's only VC11 for PHP 5.6.5, so I'll need to install that.
  4. Regarding TS vs NTS since I want CLI I'll go with NTS.
  5. x86 vs x64, don't want experimental, so x86.
I guess we have a winner:

Installing it

After extracting the archive, the novice user might be looking for a setup or install exe. I also checked quickly because years passed by since I had last installed this thing. But no luck. So the good old install.txt file needs to be consulted. And by old I mean old. It seems that this file is an accumulated pile of information that grew in all directions. It's almost 2000 lines long. It explains how to handle PHP 4 throughout the file in 26 places. PHP 4 support ended in 2008, 7 years ago. It's a lot to read.

Now at this point I wish there was an installer. There are installers. I've used them in the past. But I don't want web server integration. Just a simple scripting environment. That's what PHP is, according to the first sentence of the front page:

PHP is a popular general-purpose scripting language that is especially suited to web development.

And the install instructions clearly advice me not to use an installer:


There are several all-in-one installers over the Internet, but none of
those are endorsed by, as we believe that the manual
installation is the best choice to have your system secure and

Secure and optimized... I need a development environment.  

After some ini file renaming and environment variable modification and installing VC11 redistributable (2x, took the wrong version on the first attempt) the PHP command would finally respond.

Step 2: Installing PHPUnit

There are instructions at and at and as always there isn't just one way to Rome.

I was successful with the PEAR way in the past, and JetBrains still lists that as the first option. But as of December 2014, PHPUnit stopped supporting it. Dead end.

After some trial and error with the Composer I've downloaded the latest phpunit.phar and told IntelliJ to use it. But tests wouldn't run:
PHP Fatal error: Class IDE_PHPUnit_Framework_TestListener contains 1 abstract method and must therefore be declared abstract or implement the remaining methods
According to I can patch my IntelliJ (ugh!) then I rather try an older PHPUnit as written here

Finding that was a bit of a challenge too, since has no link to older downloads. Here they are:

And finally the library's unit tests would execute in my IDE.



Such installation and configuration procedures are the reason why I hesitate each time before finally replacing my PC, or installing a newer OS. 

All in all it took me, a former PHP developer, 3 hours. Too many brain cycles were burned, too much frustration hit the desk. That was more than a flow stopper - it marked the end of the day.

I'm sure that experienced PHP people can't relate. They know exactly how it works, and are done setting up a new machine in no time. And that's the reason for my writing... try to put yourself into the shoes of a noob.

Does it have to be that complicated? Other scripting languages think not.

PHP used to be the easy alternative. Looking at this mess of installation, and putting it in relation to what one expects these days and what the alternatives offer, I have some serious doubts that some beginners even manage to get started.

But then again maybe it's good the way it is. (If you know what I mean. That's a sarcastic comment in case you don't.)



For manual installation, I believe that the install instructions should be cleaned up completely. A new, shortish, simple file, without the burden of all historical facts of how things once were. But that's just applying patches and fixes.

It's time for the PHP Group to provide or promote an official Windows installer. That's what would really fix the situation.

Wednesday, December 10, 2014

Graphical Visualizations in JavaDoc

Having code documentation is crucial for a software. It helps the developers to understand the functions and APIs. Code that is well understood is easier to improve, and therefore more likely to live on, instead of being thrown away in favor of a complete rewrite. It serves as the specification in case the code contradicts the text, and helps identifying possible bugs.

Sometimes explaining a situation only by words is difficult and results in lengthy, hard to comprehend blathering. What options for visualizations are there?

Option 1: Images

Although most Javadoc is plain text, it permits the use of HTML formatting. Images can be embedded. Put them into a doc-files folder and link it:

 * Visualization:
 * <p><img src="doc-files/my-uml-diagram.png"/></p>

There is some cost involved in keeping this up to date; the next programmer touching the code will need to have access to the originals, and have the graph software (UML, image editor, ...) installed. This takes time.

Also, chances are your IDE will not show the image. 

Javadoc history and today's use

When Javadoc was created in the early 90s, source code comments were nothing new. The pioneering feature was the auto-generation of a technical documentation directly from the source code. Prior to such documentation generators, the technical documentation used to be (badly) maintained separate from the source code. Ouch. HTML was brand new, and the build putting those files on a web server was cutting edge technology.

Now, Javadoc is over 20 years old. At the time it was created, virtually all business software was closed source. Today, all software libraries we are using at the company are open source.
Software projects reorganized from lengthy release cycles to release-early-release-often. And software is written in many smaller, modular pieces instead of monolithic systems.

Fact is, I rarely look at Javadoc in the web browser. Most often I see it directly in the IDE. Whether it's one of our own ~40 software projects, or a library checked out from GitHub or from Maven central, or code from the JDK itself. It's just so much better to have a complete view; up to date source code and Javadoc.

Option 2: ASCII Art

An alternative to images is the good old ascii art. I personally use and recommend asciidraw.

Here's an example:

 * <pre>
 *                 +---------gn-hypo-+                +--------------gn-+
 *                 |                 | +------------> |                 |
 *                 | Osquitar        |                | ├ôscar           |
 *                 |                 |       +------> |                 |
 *                 +-----------------+       |        +-----------------+
 *                                           |
 *                                           |
 *                                           |
 *                 +---------gn-hypo-+       |
 *                 |                 | +-----+
 *                 | Osqui           |                +--------------gn-+
 *                 |                 | +-------+      |                 |
 *                 +-----------------+         +----> | Osgar           |
 *                                                    |                 |
 *                                                    +-----------------+
 * </pre>

The advantages:
  1. It's quick, create a drawing within minutes.
  2. It's super simple, no special knowledge required.
  3. No "file originals" nor special software required.
  4. Every developer can maintain it.
  5. Any IDE or editor is guaranteed to display it in line.
  6. Because of the technical limitations, one is forced to keep it simple and focus on the main components.


You may want to consider ascii visualizations for your Javadoc the next time you're struggling with explaining a circumstance. It works great for us.

Monday, September 22, 2014

Java SLF4J: Dynamic Log Level

In short

Problem: SLF4J doesn't support to set the log level at runtime.
Solution: Use Lidalia Extensions to SLF4J

The whole story

My Java projects that aren't older than 2008 all use the same logging setup: SLF4J for the interface, Logback for the implementation. It's a widely accepted standard with benefits.

Today I had the need to dynamically set the log level at run time. Just like in Log4j logger.log(priority, message). But there's no such or similar method in the API. With one eyebrow raised I started typing into Google:

Ah, good, apparently I'm not the only one. There must be a simple solution. Stackoverflow already has the matching question. But this time the answer is a surprise:
"There is no way to do this with slf4j."
That's not a statement a programmer comes across often. I shared my reaction with this popular user comment:
"WTF? SLF4J is heralded as the future of java's logging, and this simply use case is not supported?"
The SLF4J API team has arguments not to include these methods.

I wasn't too much interested in the architectural design decisions. I just needed a quick and clean solution. And here is a nice one:

Lidalia Extensions to SLF4J "An extension to SLF4J allowing logging at a level determined at run, rather than compile, time via a Level enum."

It's small, available from Maven central, adds exactly the missing functionality, and works.




public class Example {

    private static final Logger logger = LoggerFactory.getLogger(Example.class);

    public static void main(String[] args) {
        Level level = Level.valueOf(args[0]);
        logger.log(level, "Logged at a level configured at runtime");

Tuesday, May 20, 2014

If your project is still hosted on SourceForge, I automatically assume it's dead.

SourceForge was the first free project hosting platform for open source software. It was immensely popular. It was the hub during the early open source years on the web - it was the GitHub of today.

Because of its historical popularity, there is still a large amount of projects
hosted there. Some are popular, some are dead - just like in any source code
repository. Many developers moved their code to more modern platforms since then.
At first to Google Code, now to GitHub.

Here are 3 reasons why you should move away from SourceForge.

1) SourceForge feels dead. Many active projects moved away. 

When I look for a software, I don't mind too much where it's hosted. Maturity,
stability, community size and active development are the factors I look for.

I don't advocate moving your whole project with source code and bugtracker every
couple of years to the next hip kid on the block.

But the fact that so many moved on means that what's left is probably abandoned. When I search for a software, and a GitHub and a SourceForge project comes up, I tend to click the GitHub first. Also, because of my search history, I probably get to see the GitHub links first. I'm certainly not alone here.

2)  Developers have GitHub accounts

Open Source needs contributors and users. And the easiest way for developers to pitch in or file a bug report is when they already know the system and have an account. If it's complicated, I pass.

3) Scam ads: this stinks!

Here's the reason for my blog post.

Once again I was downloading a (totally outdated and abandoned, but familiar and working) software that is still on SourceForge, and then I got to see this ad:

It leads to a suspicious looking website, with Microsoft Partner logo and Windows logo, offering Windows driver updates. The kind my mother would click, thinking it's from Microsoft.

And here's the profile on web of trust:

The site even manages to get a top rating on WoT by spamming good ratings, even though all the important comments say it's a scam. WoT, please fix.
"they say I need over a dozen drivers updated. Confirming with the manufacturer, I am up to date. ..." --pmheart6
"why do all the positive comments use poor english with typical chinese mistakes?? A dishonest sales trap. It's difficult to uninstall and must be done manually." --KITRVL

SF changed ownership in 2012, and as the Wikipedia article writes "More recently additional revenue generation schemes, such as bundleware models have been trialled, with the goal of further improving sourceforge's revenue."

This looks to me like beating a dead horse. There's only downhill from here.

Please spare other people from such frustration. Even if the platform was good once, now is the time to move the code away. To BitBucket or Google Code or GitHub or wherever. Thanks.

Tuesday, April 29, 2014

Java: .equals() or == on enum values?

Both do pretty much the same thing: checking whether 2 enum values are the same. And if you search your code base, you probably find both kinds. There are some subtle differences. What are they - and which syntax should I use?

Same same - but different

Enum values are guaranteed singletons - and thus the == reference comparison is safe. The enum itself just delegates to Object.equals(), which does exactly that.

So if they execute the same code, why should I bother?

Learning the hard way - actual bug

I produced a bug which broke the build. Luckily we did have a unit test in place that caught it. But still I would have seen it earlier, had I used the "correct" syntax.

So which one is it?

And the winner is: ==

It provides compile-time safety. While MyColor.BLACK.equals(acceptsAnything)gives a warning at most, the check MyColor.BLACK == something won't even compile if something is not an instance of MyColor.

This makes it refactoring-safe. If you go - like me - and change the type of something later on, you'll notice instantly.

Another difference is the support for null. The check with == allows null on both sides. It doesn't throw, however, it may also "hide" a null value where there shouldn't be one.
The check with equals() allows null only on the right side, and throws a NullPointerException if the left side is null. That may be desired sometimes. Anyway, I vote for not letting null slip into this comparison at all. It should be checked/handled before, instead of being smart. (See my article about null handling.)
So the comparison with == is "safer" also at run-time in that it never throws a NPE, but is possibly hiding a bug when the left side should never have been null.

Then there's the argument of performance. It's irrelevant. I'm not going there.

And the visual aspect. Which one looks nicer?
When I can have compile-time safety, I don't care about the looks. I was used to .equals() simply because that's how Strings are compared. But in retrospect that's a pretty lame explanation.
On this StackOverflow question, which is exactly about the topic,  Kevin Bourrillion from Guava commented that "== may appear incorrect to the reader until he looks at the types" and concludes that "In that sense, it's less distracting to read ".equals()". Au contraire! When I see CONSTANTS around == I instantly know they're either enums, or primitives, of matching types. If not, the code is either red because it doesn't compile, or with a yellow warning from my intelligent IDE saying that .equals() should be used. For example Strings or primitive wrappers like Integer.

After that incident at the company we've decided to go with reference equality comparison ==.

Tuesday, April 22, 2014

Java: Evolution of Handling Null References

Using null in programming is playing with fire. It's powerful and sometimes the right choice, but the infamous NullPointerException can sneak in quickly. That's why I advocate for every software project to include a section about handling null in the coding guidelines.

The first time I've seen a Java exception was in the 90s in the web browser's status bar when surfing the net: a NPE caused by an applet. I had no idea what it meant and what I had done wrong. Today I know that the programmer let a null reference slip in where it wasn't expected to happen.

When looking at open source software written in Java I often come across code where null references are used but not documented. Sometimes preconditions are in place. When stepping deeper it gets really tricky to figure out which variable is now allowed to be null and which is not. For the author it was obvious at the time of writing the code... but software becomes better when written and looked over by many, and thus it is important to make it clear for everyone, everywhere.

Software is constantly shipped with NPEs detected later on. Redeployments and bugfix releases are expensive - and it's a pity because this kind of bug could be eliminated almost completely.

Here's my personal evolution of dealing with null in Java.

Level 1: No Plan

Using it wherever, no information about it in the Javadoc.
Fixing NPEs as they occur, seeing no problem in that.
public class Foo {
    public String getText() {
        return null;

Level 2: Document Null - Sometimes

Realizing that NPEs are a problem that could and should be avoided. Detecting them late is expensive, especially after deployment.
Starting to document variables, arguments and return values that can be null... incomplete.
public class Foo {
     * @return the text, or null
    public String getText() {
        return null;

Level 3: Add "Null" to Method Names

Realizing it's still a problem. Having Javadoc is nice, but useless when not read.
Starting to name methods with null such as getFooOrNull() to force it to be seen.
public class Foo {
     * @return the text, or null
    public String getTextOrNull() {
        return null;

Level 4: Code Annotations

Using Jetbrains' null annotations: Annotating all variables with @Nullable and @NotNull. This is a huge step forward. The crippled method name pattern 'OrNull' is obsolete. The code is automatically documented.
public class Foo {
    public String getText() {
        return null;
And the best part: the IDE checks the code and warns on a mistake.

Using these annotations strictly I don't remember causing a single NPE in years. The drawback is more typing, more text on the screen. But then we use static typing, and not defining null is just incomplete.

Level 5: Guava's Optional

The concept: Instead of using the null reference directly, wrap it in another object that permits null. The getter methods on it myOptional.get(), myOptional.orNull() and myOptional.or(alternative) force the user to think about what to do when it's null. The API becomes extremely clean.

If you don't know about it yet, read the Guava page about using and avoiding null.

It took me a couple of days to get used to this. And I did produce a handful of bugs initially because I've used Optional.of() instead of Optional.fromNullable() by mistake.

Although step 4 with annotations already got rid of NPEs and improved the code confidence, this was another big step forward. The API of the Optional class is clean, and the API of code using it is consistent and clear. A developer only has to learn the concept once, and because it's from Guava, sooner or later everyone will know it.

The Guava team uses @Nullable annotations for the few places where null is permitted. Anywhere else there are no annotations.

Level 6: Java 8 has Optional built in

Oracle has a nice article on it. The API is slightly different, and I have not made the transition to Java 8 yet.

Finger Pointing

The bugtracker for the Glassfish software currently has 1309 matches for NullPointerException. Wow. (Go to the issue navigator, select the 11 projects on the left starting with "glassfish", and add the query NullPointerException, hit enter. The software is session-based, can't paste a link...)

The Grizzly project has 69.

This is a codebase that's still on level 1 regarding null. Exactly 2 years ago I had recommended to improve the Javadoc and start using null annotations. The task was accepted, got priority "Major" assigned, but is still open and things haven't changed.

I had also mentioned in that task the undocumented method int getQueueLimit() which would return the magical -1 for 'no limit'. Nowadays - being a level 5 guru ;-) - I'd instantly turn this into a self-documenting Optional. This forces users to think about the exceptional case. No horrible bugs from computations with -1 can occur. Users of it would then just do something like getQueueLimit().or(Integer.MAX_INT) or whatever suits their case - short, clear and safe.

My Recommendations

1) Avoid Null when possible.

Consider the null object pattern.

Return empty collections, not null.

Throw UnsupportedOperationException() instead of returning null.
Don't do this:
public Object foo() {
     return null; //todo
It is done sometimes as a quick way to satisfy the compiler. It's the no-impl pattern.

Do this instead:
public Object foo() {
     throw new UnsupportedOperationException(); //todo
If you return null then it (null) can go a long way until it throws a nasty NullPointerException somewhere. By throwing UOE directly the cause is clear in the stack trace.

2) Use Guava's Optional.

Wherever you would use null, use Optional. 
Except in very low level code.

3) Use Jetbrains's Null Annotations.

If allowing null then use the @Nullable annotation.This is a must. It automatically tells the user and the IDE that null is possible/permitted.

Use the @NotNull annotation everywhere else. This is optional. The Guava people do not, I do.

To get the annotations using Maven: