Tuesday, May 5, 2015

Interpretations of end-of-file, and Linux ad hoc open heart surgery

Today's occasional Linux sysadmin task for a Java developer (me) is to loop all lines in a text file and run a task based on each. Can't be that hard, can it?

Off we go the usual route:

1) Google the task. Aha, looks like a common requirement.

2) Find the usual stackoverflow, askubuntu, superuser and some other forums and blog posts.

And the offered snippets looks simple and clear:

while read line; do
    echo "line is: $line"
done </my/file.txt

or alternatively

cat /my/file.txt | while read line; do
    echo "line is: $line"
done

And it works! Mostly. Occasionally the last line is missing. This happens when there is no trailing blank line in the file. A detailed explanation to that problem is here: http://stackoverflow.com/questions/12916352/shell-script-read-missing-last-line

So finally here's a snipped that works for all files:

cat /tmp/testfile | while read line || [ -n "$line" ]; do
    echo "line is: $line"
done

 

Fragile solutions


This most recent experience is very much like so many others I've had with Linux system administration tasks. Faced with a seemingly simple problem, I find many solutions, but there are hidden pitfalls. An implemented solution must be stable, and not depend on assumptions and best practices and current conditions. That's what we practice in programming. Why not in system administration?

In this case one can argue that a text file must end with a newline character. Some pro arguments:
  • Apparently there's some C specification (from a time before you were born).
  • When storing a text file in the Linux editor vi it's done, the newline is appended. (vi also being from the 70s)
  • When writing a file by code there's usually a println(string) method. This also results in the trailing blank line.
But there's also contra:
  • Files can come from other sources. For example from a Windows Notepad++ and the newline can be missing. 
  • Or if the file is generated by a program, it's easy for a developer to change some logic and remove the trailing newline, without being aware of the consequences.
  • In PHP, if you include a script file that ends with a newline after the closing "?>" end tag, it sends white space to the client, and prevents you from adding header()s. You must break that C standard. (PHP is written in C you see ...)
  • It just sounds like a silly specification, way too easy to break, and not necessary.
Ask yourself the question: if you'd write the specification where a file ends, would you choose "the last newline character" or "where the last bit ends"?


Stackoverflow helps but pay attention to the comments


Another case I recently run into was with a Linux cleanup process. The task definition was simple again: "delete all but the most recent 5 files from a folder matching a file name pattern". An accepted solution is here http://stackoverflow.com/questions/25785/delete-all-but-the-most-recent-x-files-in-bash with plenty of upvotes. The problems, which are not mentioned in the answer:
  • If there are only 5 or less files, it deletes them all
  • It doesn't work if there are folders present
  • It won't work with file names with spaces
In time I've learned to pay attention to the user's comments. And one saying "This one fails if there are no files to delete." made me suspicious. Again, it doesn't fail, the command goes through like a knife cuts through soft butter... no complaint, just removes them all.

Imagine this scenario (mine): It's the backup folder, and the last 5 backups should be kept. For some reason the backup process fails, and no new files are created. The separate cleanup script still runs nightly and removes all but the last 5 files. With this little glitch in the script you'll end up having no backup at all.


All ad hoc, open heart surgery


Linux sysadmin is not in my job description. I see the situation from a bit of distance. What I realize is that programming has come a long way in the last 20 years. Open source libraries are used instead of quickly hacked untested functions. Everything is version controlled. Nothing gets deployed without thorough testing. Pair programming. Code reviews.

System administration? Still the same. Copy pasting some commands found on the internet. On production machines. Works? No? Try another one. Works? Fine. Document changes? Nah.

Any sysadmin of +5 years who hasn't locked himself out of an ssh server by misconfiguring sshd or iptables in da house please stand up.


Conclusions


It's easier these days to find answers and solutions on the internet thanks to the Q&A format of Stack Exchange. But more than in  programming, in sysadmin the comments and secondary answers are important to read.

At the company we stick to some simple rules
  • Changes made to machines must to be documented. Remine for tasks and especially the Wiki works well.
  • Mission critical machines have a sibling.
  • System changes on production machines must be applied one at a time, with a few days in between.
  • Most apps run on VMs. It's not like git but it's as good as it gets for the time being.


End-of-file Marker


Back to the initial task for looping a file to the end: when reading mission critical changeable files by Java we use an end of file marker "eof". If the file ends with that line then we can be sure the file was read completely. If not, then it could be broken, and the program throws an exception.





2 comments:

  1. It might also be worth mentioning that what constitutes a newline can differ! There are most commonly CR, LF, CR+LF, as well as the unicode line separator (U+2028) which almost nothing uses by default even in modern systems.

    ReplyDelete
  2. I have not locked myself out of box by misconfiguring SSH and/or iptables.

    What are you talking about is either:

    * understaffed IT - admins dont have time to "do it right" and just put patches on patches
    * bad SA - there are tools to fix it and languages that are not shit.

    or combination of both

    We have automation. We have Puppet. We have MCollective. We know how to code in something other than that awful shitshow of language that is called shell.

    I can say "apply my puppet manifest on 5% of machines running jetty and belonging to project XYZ"

    Every change is documented in git log, together with ticket #id that caused it. Those that dont (...yet... work in progress) like switches are having their configs pulled to other repo.

    Adding a new machine is literally just a starting installer and going for a break. And configuration management ensures it will be same as other ones.

    And after that it will be automatically added to backup and monitoring.

    We actually need to explictly specify which machine is **NOT** backuped/monitored, so not having machine in nagios is HARDER to do than having it.

    ReplyDelete