From the time I had set up my first server at home over a decade ago, I’ve performed numerous operating system upgrades. Usually, it used to take me several hours – if not days – to complete each upgrade and make sure that everything would work as expected. During all these years, I’ve been working hard whenever time permitted it in order to make several pieces of software work flawlessly together requiring the least possible time for manual maintenance. Despite the deployment of my services having reached a high level of automation, I recently spent almost a whole day upgrading CentOS in one of my remote boxes.
According to my initial plan this procedure shouldn’t have taken longer than 2-3 hours. I had simulated it in Virtualbox at home and I knew exactly what to expect. Unfortunately, I didn’t strictly follow the plan, but deviated from it 2 times and this almost cost me the whole day.
The first thing that went wrong had to do with testing my backup, a step that was not in my original plan. I keep my server data in encrypted containers on Amazon S3 using duplicity. Although I have restored data from the backup numerous times and I was certain it worked OK, I had this strange idea to test the restoration of the data to a virtual machine at home just to make sure. For that purpose I happened to use a VM whose state had been saved several days ago, meaning that its time was way out of sync. That was a detail I had n’t taken into account. So, when I tried to restore the data on that box, I got a glorious exception from duplicity informing me that it could not find any signatures on the S3 bucket. That message was really unhelpful and it resulted in wasting many hours trying to figure out what was wrong with my backup or duplicity, until I finally realized that it was the box’s wrong time that had caused the exception. Once the time was updated, duplicity worked like a charm.
The second thing that went wrong had to do with pvGRUB, which is based on the grub 0.97 code and used to boot Xen DomUs (guests). Due to some limitations of the VPS provider regarding pvgrub, I have to use a very small partition that contains a GRUB configuration file which eventually boots CentOS (root LVM setup). This small partition was initially formatted using ext3. Again, I had a strange hunch to reformat that small partition to ext4! This would have absolutely no benefit, but at that moment I had just thought “why not?”. I was completely unaware that grub 0.97 and eventually pvgrub did not support the ext4 filesystem. To make things even worse, pvgrub deceptively reported that it had recognized the partition as ext2, but could not locate the file I had configured it to load. Disaster. It was a few hours later, after having gone through several bug trackers and mailing lists, that I realized that pvgrub did not actually support reading from ext4. I reformatted the small partition to ext3 and everything went on smoothly.
If I had stuck to the original plan, none of the incidents above would have taken place. No matter how much I trust free software, deciding to experiment with it while I should be doing a specific job is admittedly one of the worst decisions possible. Regardless of how popular a piece of free software might be, it can still have serious bugs and limitations hidden in the last place you’d ever look. Lesson learned: stay on your path and strictly follow the plan.
Lessons learned from a recent OS upgrade by George Notaras is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright © 2012 - Some Rights Reserved
I found my way to your blog by looking for info about curl/wget cookie management. It was a great article, thank you.
This article was very comical, though I am sorry to hear anyone wasting a full day on such trivial mishaps.
That said, your comment about “bugs and limitations” should (in this case) not be directed at the free software, but rather at the free software user. Both of these issues arose from human error. Time is something you don’t mess with in *nix without consequence, and with pvGRUB, you just forgot to RRTFM (re-read the f*cking manual). :)
Hi Curtis.
My comments about “bugs and limitations” are generally directed to the way popular Free Software is managed and not to the software itself or its developers. You see, I find it a paradox that some pieces of FOSS are installed in millions of production boxes and the progress of their development is still limited by the amount of free time of their original developers. IMHO, this is what is really comical.
For instance, the development of GRUB 1 stopped because the original developers decided to invest their free time in the development of GRUB 2, while GRUB 1 was still the main boot loader in some major Linux distributions. A user that encountered a problem with GRUB 1 code had to go through the bug trackers of several major distributions to find a resolution. This is a waste of time as a result of bad management. However, no one can blame the developers, it’s their free time after all and it is them who decide how to spend it, but the argument of bad management still stands. Personally, I blame the FOSS community and its mentality, but I’ll try to explain my views more extensively in a new post.
Thanks for your feedback. I hope I managed to give you an idea about the angle I’m trying to examine such issues.
Yes, I agree with your response in its entirety. It is rather amazing that some geeks hobby becomes a core function of a server that is relied on by major corporations. But you have to admit that a movement without a truly central focus (other than make things that are cool) manages to produce such amazing software.
I think that your example of GRUB though is the fault of the distributions not being willing to follow upstream closely enough (or more likely not yet trusting a relatively young project when it plays such a vital function). But I know that at least Fedora continued to patch GRUB1 as necessary, even adding GPT functionality. But I guess you are probably right that a company would have continued to at least maintain the old version while starting up on their new and improved version.
A bit OT, but I think GRUB2 is an atrociously monstrostic piece of code. Linux has definitely moved away from the unix philosophy of “do one thing well”, but I think that in the case of a bootloader, this should still stand. I think it just tries to do too much and have it all packed in to its single pacakge. I have long thought that it would make more sense to have GRUB2 simply boot the kernel, and then if you want other features (like the ability to read LVM for instance), then it could be extensible. But that is just my opinion.
In any case, I appreciated your response. May your blog continue as it will. Thanks again for teaching me how to use cookies with wget and curl!