Does Long Server Uptime Reveal Management Failure?

February 4, 2010

While perusing some sysadmin blogs, I caught an older post by Sam Pointer regarding server uptimes that struck a chord with me. Too often I find people touting high server uptimes as evidence of success when it actually means you may have failed as an IT manager.

The Uptime Badge of Honor
Years ago, high sever uptimes were common and even seen as signs of a good sysadmin with command line mojo. Rebooting a mainframe system could be a multi-day affair, and your confidence that the system would actually reboot properly was often lacking. So you kept systems online. Since they were not connected to the internet, most security threats were from within which could be more easily managed than the public threats of today.

When in grad school, I managed an old Sun Sparc server that was used for some internal communications and to run DNA analysis software. Though I rebooted my Windows desktop frequently, I took pride that the Sparc box, affectionately known as Beagle after Charles Darwin’s ship, had been running for over 1000 days. When our building’s private network was bridge to the campus network, the box was hacked in 2 days.

A well know flaw in the OS was quickly exploited. The patches had been applied but since the system was never rebooted, the security updates never took effect.

Reboots are Required
If you are running a recent OS, such as RHEL 4 or RHEL 5, and have not rebooted your server in the last 30 days, you likely have an outdated kernel. Depending on the situation, your system could be vulnerable to attack. Without rebooting your server, you cannot update the kernel.

If you want to assure the system is patched properly, a reboot is required.

Reboots are Good
Another thing happens during reboots that is very important and often overlooked – the dreaded fsck. (There’s no place like 127.0.0.1 has a nice post on fsck). When the system restarts, Linux will run a file system check if required. Sometimes it can automatically repair issues while other times input is required.

With the increasing use of block level backup systems, such as R1Soft’s CDP system we use, file system errors can be backed up. The backups function at a level lower than the file system. So if your file system is corrupt, your backups will also have a corrupt file system.

Doing regular reboots assures your file system is healthy and keeps your backups healthy.

Reboot As Needed
Some my prescription is to reboot as needed. While certain situations may allow you to maintain high levels of uptime, I find rebooting when new kernels are available is a wise thing to do, especially for clients running standalone, internet connected servers. Doing so assures patches are applied properly, the file system is checked, and the system runs smoothly.