Friday, September 11, 2015

IntelliJ Feature Request: Find Escaped Strings


IntelliJ has some automatic String escaping built in.

IntelliJ automatic String escaping on copy/paste


Example: pasting TO IntelliJ
Write some code:
String s = "";
Then paste this sequence between the double quotes: <li><a href="#">
What you have is:
String s = "<li><a href=\"#\">";
IntelliJ intelligently escaped the double quotes for you.

Example: copying FROM IntelliJ
Select the string literal above, including the surrounding double quotes ("<li><a href=\"#\">"), and paste it into Notepad. What you get is exactly that.
Select the string content, excluding the double quotes (<li><a href=\"#\">"), and paste it into Notepad. What you get is the string unescaped: <li><a href="#">

That's pretty smart, and most always what I want.


IntelliJ automatic String escaping on search


Example: ctrl-alt-f searching for <li><a href="#"> does not find our code String s = "<li><a href=\"#\">";



Here's an idea for the IDEA team: how about a checkbox "Find escaped strings"?
Of course, in practice it'll get a bit more complicated than just handling the double quotes. Depending on the programming language. JavaScript has single quotes, there are regular expression and SQL escapes, etc. And I'd want to be careful with the search-and-replace functionality (or just not offer this feature there).


Conclusion

The Java editor has already proven that it can handle the escapes to the advantage of the developer. So why not extend to search?

Wednesday, June 17, 2015

French OVH vs German Hetzner: the Stereotypes Are Alive

This is an open letter describing my experiences and offering some suggestions to the French hosting company OVH.

I have had 2 cheap VPS at OVH for a year now. (And services from Hetzner for longer.)

Every once in a while, one of the virtual private servers at OVH is unreachable. Today was such a case; no information from OVH announcing the downtime, no alert while offline. My own monitoring shows one VPS unreachable for 1.5 hours in the afternoon. 1.5 hours. I wonder what it is that takes so long to be fixed.

So let's check the OVH status:

 
D'Accord.

I prefer how the competing German ISP Hetzner has a completely separate domain name for the status page at http://www.hetzner-status.de/ because you know maybe something is wrong with the primary domain or DNS record or whatever and then hell breaks loose.

Quite a while later I was able to get that page, here's how it looks:



The colors look scary, issues everywhere. But that's probably explainable by the large size of the ISP.

Although it's not visually visible, I have clicked and therefore filtered by the VPS category on the screen.

Unfortunately, with the information given in the table, I am not able to identify by which of the many cases my VPS could be affected. So let's log in to the admin panel of my hosting account.

There is no alert message here either. After a while I'm able to find charts that prove the downtime, and they indicate a restart:



The CPU was off and then a bump (boot). Ram was off too, and traffic was zero.

Recommendation to OVH:

  1. The backend tells me the VPS is in zone vz-rbx2-005. If you'd list that on the status page in the table, I'd have a chance identifying issues that affect me.
  2. Even better, show the issue in the customer control panel overview, and on the VPS overview page. With a nice history of all past issues.
  3. If it was a predictable/planned downtime, inform before. Hetzner does this. The last notice was about replacing a Switch. It came 1 week (!) before. And after all there was 0 downtime. German planning.
  4. During downtime, send an info pointing to the task. At least if it is not resolved quickly.
  5. After prolonged downtime (1.5 hours!) and an OS reboot (!) send an info maybe? That it happened, what the cause was, and what you intend to do in the future to prevent it. After finishing your croissant maybe.

Regarding email, OVH used to send alerts. There was a period with annoying alerts for high CPU load caused by Rsyslog on Ubuntu. OVH fixed it and they are gone. Other than that, I have not received alerts.

Out of curiosity, let's check the uptime of my second OVH virtual server.

> uptime
> up 2 days, 54 min,  1 user,  load average: 0.08, 0.03, 0.01

Quelle surprise! Considering that I have not touched this server in months...


Final words

I have no affiliation with any of the 2 providers, other than being a customer with both.

At OVH, everything feels a bit "French". The admin panels; confusing, one redirecting to the next and back. Parts are translated, parts appear in Français. The website. The payment process. I would not host anything uptime critical there.

Hetzner is not perfect. In a subjective classification I'd give them a "good to very good" - for my needs. Great value for the price. And I don't see anyone else on the horizon with a better service offering at competitive costs.

If you need a place for a game server, or other leet services, please choose OVH with their DDoS protection. I'd rather have you not too close to my important machines ;-)

If anyone else asks, then I personally recommend Hetzner.

Tuesday, May 5, 2015

Interpretations of end-of-file, and Linux ad hoc open heart surgery

Today's occasional Linux sysadmin task for a Java developer (me) is to loop all lines in a text file and run a task based on each. Can't be that hard, can it?

Off we go the usual route:

1) Google the task. Aha, looks like a common requirement.

2) Find the usual stackoverflow, askubuntu, superuser and some other forums and blog posts.

And the offered snippets looks simple and clear:

while read line; do
    echo "line is: $line"
done </my/file.txt

or alternatively

cat /my/file.txt | while read line; do
    echo "line is: $line"
done

And it works! Mostly. Occasionally the last line is missing. This happens when there is no trailing blank line in the file. A detailed explanation to that problem is here: http://stackoverflow.com/questions/12916352/shell-script-read-missing-last-line

So finally here's a snipped that works for all files:

cat /tmp/testfile | while read line || [ -n "$line" ]; do
    echo "line is: $line"
done

 

Fragile solutions


This most recent experience is very much like so many others I've had with Linux system administration tasks. Faced with a seemingly simple problem, I find many solutions, but there are hidden pitfalls. An implemented solution must be stable, and not depend on assumptions and best practices and current conditions. That's what we practice in programming. Why not in system administration?

In this case one can argue that a text file must end with a newline character. Some pro arguments:
  • Apparently there's some C specification (from a time before you were born).
  • When storing a text file in the Linux editor vi it's done, the newline is appended. (vi also being from the 70s)
  • When writing a file by code there's usually a println(string) method. This also results in the trailing blank line.
But there's also contra:
  • Files can come from other sources. For example from a Windows Notepad++ and the newline can be missing. 
  • Or if the file is generated by a program, it's easy for a developer to change some logic and remove the trailing newline, without being aware of the consequences.
  • In PHP, if you include a script file that ends with a newline after the closing "?>" end tag, it sends white space to the client, and prevents you from adding header()s. You must break that C standard. (PHP is written in C you see ...)
  • It just sounds like a silly specification, way too easy to break, and not necessary.
Ask yourself the question: if you'd write the specification where a file ends, would you choose "the last newline character" or "where the last bit ends"?


Stackoverflow helps but pay attention to the comments


Another case I recently run into was with a Linux cleanup process. The task definition was simple again: "delete all but the most recent 5 files from a folder matching a file name pattern". An accepted solution is here http://stackoverflow.com/questions/25785/delete-all-but-the-most-recent-x-files-in-bash with plenty of upvotes. The problems, which are not mentioned in the answer:
  • If there are only 5 or less files, it deletes them all
  • It doesn't work if there are folders present
  • It won't work with file names with spaces
In time I've learned to pay attention to the user's comments. And one saying "This one fails if there are no files to delete." made me suspicious. Again, it doesn't fail, the command goes through like a knife cuts through soft butter... no complaint, just removes them all.

Imagine this scenario (mine): It's the backup folder, and the last 5 backups should be kept. For some reason the backup process fails, and no new files are created. The separate cleanup script still runs nightly and removes all but the last 5 files. With this little glitch in the script you'll end up having no backup at all.


All ad hoc, open heart surgery


Linux sysadmin is not in my job description. I see the situation from a bit of distance. What I realize is that programming has come a long way in the last 20 years. Open source libraries are used instead of quickly hacked untested functions. Everything is version controlled. Nothing gets deployed without thorough testing. Pair programming. Code reviews.

System administration? Still the same. Copy pasting some commands found on the internet. On production machines. Works? No? Try another one. Works? Fine. Document changes? Nah.

Any sysadmin of +5 years who hasn't locked himself out of an ssh server by misconfiguring sshd or iptables in da house please stand up.


Conclusions


It's easier these days to find answers and solutions on the internet thanks to the Q&A format of Stack Exchange. But more than in  programming, in sysadmin the comments and secondary answers are important to read.

At the company we stick to some simple rules
  • Changes made to machines must to be documented. Remine for tasks and especially the Wiki works well.
  • Mission critical machines have a sibling.
  • System changes on production machines must be applied one at a time, with a few days in between.
  • Most apps run on VMs. It's not like git but it's as good as it gets for the time being.


End-of-file Marker


Back to the initial task for looping a file to the end: when reading mission critical changeable files by Java we use an end of file marker "eof". If the file ends with that line then we can be sure the file was read completely. If not, then it could be broken, and the program throws an exception.





Sunday, April 5, 2015

Is that Googlebot user agent really from Google?

How do you know if the client identifying as Googlebot is actually from Google? And Bingbot? And all the others? On Github there's a newly published library, written in Java with Maven that performs the checks for you. And why would you care?

There are different kinds of visitors to your website. Some webmasters don't mind any of the traffic, don't discriminate, and let them all pass. Others block or at least limit or throttle some of the requests. But why?
In the end it all boils down to the visitors you want, and those that you don't want. Those that bring you potential benefit, and those that only cost you traffic, and possibly worse duplicate your site content, collect email addresses, or align your site with bad reputation.

There are 2 kinds of users visiting your site. Humans, and robots. This article (and the software) is about the bots only.

The robots can be classified as follows:
- search engine crawlers: googlebot, bingbot, baiduspider, ...
- other purpose crawlers: archive.org_bot, ahrefsbot, ...
- custom crawlers: data scrapers, email harvesters, offline readers, ...

Search engines

Search engines operate crawlers to feed their indexes. You generally want to let these through because they drive visitors to your site. The more pages they index, the better for you.

However, there may be exceptions. Dan Birken has an interesting read on this: Baidu is a greedy spider that possibly gives you nothing in return, depending on your target audience.

And these days search engines don't only operate text search, there is also images, videos, and more. Some sites may have not interest in appearing in the Google image search, and therefore lock out the Googlebot-Image crawler.

Other purpose crawlers

The web archive is a nice thing. Some webmasters choose to not have a pubic history of their site, and block the bot.

Commons Crawl is another one that some don't mind, and others forbid, depending on their site content.

Blocking blood suckers including the ahrefsbot, a6, ADmantX and more has a larger fan base.

And finally blocking email harvesters should be on every webmaster's menu.

Custom crawlers

This is software from the shelf or even custom tailored to grab parts or your entire website. These bots are not run from a specific website or service provider. Any script kiddie can fire them up. If you operate a blog, on a shared hosting with unlimited traffic, you probably don't mind. In the best case you end up being listed on some blogging directory. In a worse case parts of your content appear remixed on a seo index spammer's fake portal, alined with links to sites of bad reputation.

The problem: impersonators

As long as all visitors identify with the correct and honest user-agent they are, the task is simple. You define in your /robots.txt which crawlers are allowed. Task accomplished. But as we all know (that's where you nod), the user-agent is just and http header field that anyone can modify. Even and especially from PHP. Pun intended.

Some masquerade as a common web browser, eg Chrome or Firefox. It's a bit more tricky to identify these, but there are ways.

The well known spider trap trick can help: a link in the html to /dont-go-here that humans can't see thanks to CSS, and that nice bots won't follow because you forbid the area in the robots.txt. Then, if someone hits that page, he's either a vampire bot, or a curious web developer who views the source, or a blind person agnostic to CSS.

Other ideas are limiting the amount of web traffic per user (ip/network, session). Some sites implement a Captcha after too many requests per time.

Other custom crawlers impersonate as one of the well known spiders; Googlebot. Chances are there are no request limits in place. After all who wants to risk being deleted from the Google index.

The solution: spider identification

That's where this software library comes in. The users that identify as one of the top search sites need to be verified.

For a long time the way of identification used to be by maintaining ip lists. Good news! For many years the major players have been supporting and advocating a zero-maintenance system. It works by validating the ip with a reverse and forward DNS lookup.

For those search engines and sites that don't support this method, identification still needs ip lists.
History has shown that this way is problematic. The ip ranges change, the lists always lag behind. Search for lists on Google and you'll find totally outdated lists in the tops. Lists that still mention msnbot (now bingbot) and Yahoo! Slurp (now defunct).

A library, used as a dependency by many developers, hosted on Github where it's easy to contribute, hopefully solves that problem.

Actions taken on spoofers

What action you take if verification returns negative is up to you. Block the user, throttle the connection, display a Captcha, alert the webmaster.

You can run the verification check live as the request comes in. There's an initial price you pay on the first request for the DNS lookups. Or you can run a watchdog on your webserver's logfiles and take action a bit delayed.

Call for action

Head over to https://github.com/optimaize/webcrawler-verifier and replace your custom Googlebot verification hack, using outdated ip lists, with this dependency.

Thursday, January 29, 2015

Manual PHP Installation on Windows: for Masochists only?


Yesterday I needed to quickly apply some changes to a PHP library hosted on GitHub. I haven't used PHP since I switched to a new PC, so there's some setup to do. But for me as a former PHP developer it won't take too long, will it?

This article won't teach you how to install PHP on Windows. It illustrates my experience, the troubles I went through, and suggests improvements to the process.


The requirements: PHP and PHPUnit


Because I should test my code before committing, I need a runtime environment. IntelliJ IDEA (the big brother of PhpStorm) is here already. What's left is PHP and PHPUnit. The library to lay my hands on doesn't require web. No web server, just scripting.


Step 1:  Install PHP

 

Choosing the right version

Visiting php.net, I learn that the latest releases are 5.4.37, 5.6.5 and 5.5.21. Hrm. What to choose? The downloads page http://php.net/downloads.php tells me that the current stable is 5.6.5, so probably that. Just to be sure, I'd like to know a bit more about the compatibility and feature sets of these versions. On the PHP page, including the documentation, I don't find that kind of data. Wikipedia is my friend: http://en.wikipedia.org/wiki/PHP#Release_history so the 5.6 branch it is.

There are more choices to be done on http://windows.php.net/download#php-5.6


32 or 64 bit? Thread safe or not? Zip or Debug Pack? Or download source code?
Ah, there's a helping hand on the left "Which version do I choose?"
Since there isn't an easy answer, I go with the rule-out logic to limit the choices.
  1. IIS? no.
  2. Apache? no.
  3. VC9 and VC11? There's only VC11 for PHP 5.6.5, so I'll need to install that.
  4. Regarding TS vs NTS since I want CLI I'll go with NTS.
  5. x86 vs x64, don't want experimental, so x86.
I guess we have a winner: php-5.6.5-nts-Win32-VC11-x86.zip

Installing it

After extracting the archive, the novice user might be looking for a setup or install exe. I also checked quickly because years passed by since I had last installed this thing. But no luck. So the good old install.txt file needs to be consulted. And by old I mean old. It seems that this file is an accumulated pile of information that grew in all directions. It's almost 2000 lines long. It explains how to handle PHP 4 throughout the file in 26 places. PHP 4 support ended in 2008, 7 years ago. It's a lot to read.

Now at this point I wish there was an installer. There are installers. I've used them in the past. But I don't want web server integration. Just a simple scripting environment. That's what PHP is, according to the first sentence of the front page:

PHP is a popular general-purpose scripting language that is especially suited to web development.

And the install instructions clearly advice me not to use an installer:

Warning

There are several all-in-one installers over the Internet, but none of
those are endorsed by PHP.net, as we believe that the manual
installation is the best choice to have your system secure and
optimised.

Secure and optimized... I need a development environment.  

After some ini file renaming and environment variable modification and installing VC11 redistributable (2x, took the wrong version on the first attempt) the PHP command would finally respond.


Step 2: Installing PHPUnit


There are instructions at https://www.jetbrains.com/idea/help/enabling-phpunit-support.html and at https://phpunit.de/manual/current/en/installation.html and as always there isn't just one way to Rome.

I was successful with the PEAR way in the past, and JetBrains still lists that as the first option. But as of December 2014, PHPUnit stopped supporting it. Dead end.

After some trial and error with the Composer I've downloaded the latest phpunit.phar and told IntelliJ to use it. But tests wouldn't run:
PHP Fatal error: Class IDE_PHPUnit_Framework_TestListener contains 1 abstract method and must therefore be declared abstract or implement the remaining methods
According to http://stackoverflow.com/questions/22799619/intellij-idea-wont-run-phpunit-4-0-tests/22799620 I can patch my IntelliJ (ugh!) then I rather try an older PHPUnit as written here http://stackoverflow.com/questions/22531584/how-to-setup-phpunit-for-intellij-idea-13-1-on-ubuntu-13-10

Finding that was a bit of a challenge too, since https://phpunit.de/ has no link to older downloads. Here they are: https://phar.phpunit.de/
 



And finally the library's unit tests would execute in my IDE.

 

Conclusions



Such installation and configuration procedures are the reason why I hesitate each time before finally replacing my PC, or installing a newer OS. 

All in all it took me, a former PHP developer, 3 hours. Too many brain cycles were burned, too much frustration hit the desk. That was more than a flow stopper - it marked the end of the day.

I'm sure that experienced PHP people can't relate. They know exactly how it works, and are done setting up a new machine in no time. And that's the reason for my writing... try to put yourself into the shoes of a noob.

Does it have to be that complicated? Other scripting languages think not.

PHP used to be the easy alternative. Looking at this mess of installation, and putting it in relation to what one expects these days and what the alternatives offer, I have some serious doubts that some beginners even manage to get started.

But then again maybe it's good the way it is. (If you know what I mean. That's a sarcastic comment in case you don't.)

 

Suggestions



For manual installation, I believe that the install instructions should be cleaned up completely. A new, shortish, simple file, without the burden of all historical facts of how things once were. But that's just applying patches and fixes.

It's time for the PHP Group to provide or promote an official Windows installer. That's what would really fix the situation.