Achieving high uptime is a very noble goal. To that purpose, Transparen tends to purchase high-end server hardware that allows us to do things like insert new hot-plug ServeRaid (3L or 4L, these days) cards and initialize new 10-20 SCSI disk enclosures without shutting down the servers or re-initializing the operating systems. In addition, it is why Transparen cares about ensuring a consistent Internet connection, and high-availability hydro-electric power for its servers. But despite all of these things, if uptime is the only factor managed, it is difficult to maintain more than 99% uptime under normal circumstances, and when failures occur, they can take a long time to resolve (sometimes days, not just hours or minutes).
Factors Resulting in High Uptime
Naturally, we believe in following industry best practices where it comes to maintaining high uptime, which include:
- Using good hardware
- Ensuring that electric power is highly available
- Using Uninterruptible Power Supplies (UPS) to prevent short power interruptions from limiting uptime
- Having redundant parts in the servers, including extra hard drives (RAID configurations), extra network cards, extra power supplies, etc.
- Practising a conservative approach to software changes - making high-risk changes only when absolutely necessary, while taking regular actions to ensure that stability improvements are implemented promptly.
Despite Best Practices, Good Hardware, and Ideal Environment, Server Uptime is Limited By Single Points of Failure
Despite all these practices, a solitary machine, even with redundant parts, may still fail, because not all of its parts are redundant, and there are still things that can happen that will limit uptime. For instance:
- More often than one might realize, a localized power outage may occur - one which may not affect a whole building, but which may affect the server. The most common example is a power breaker may flip, or a power cord may be unplugged.
- A network connection may be severed. This could happen in many ways - the simplest is that the ethernet cable can become unreliable and wiggle slightly free, either on the router or the server. But there could be other ways, including router failures, fried ethernet cards, internet provider problems, etc.
- The RAID array may collapse. Even though RAID provides hard drive redundancy, there are still parts of the RAID array that can fail and take the whole thing down. These include:
- The RAID card (or SCSI/SATA/IDE card, if implementing a software RAID)
- The backplane (i.e. what all the drives plug into)
- The cabling between the RAID card and the backplane
- Catastrophic hard drive failures (i.e. multiple hard drive failures beyond the redundancy provided by the RAID configuration)
- Memory might be defective - Server memory is usually provisioned with error correction codes (ECC), but these may still fail under certain circumstances.
- Power supplies might go out of commission and require replacement. If the power supply redundancy is not sufficient, then the machine may need to be shut down, although it may be possible to replace parts without necessitating shutdown.
In other words, there are too many points of failure, and therefore the odds are stacked against keeping a single server up for years and years.
Availability is Not Limited By Uptime
But even as individual servers may need to be taken down for maintenance from time to time, either voluntarily, or involuntarily - this does not mean that the 'system' cannot remain available. During such times, the goal is rather to allow the system to continue to operate, only perhaps not as powerfully as when all servers are up. In other words, ideally, if a server goes down, the system should operate a tiny bit slower than normal, but continue to operate. This way, services can be provided continuously, despite hardware problems that occasionally arise.
The benefit is that availability is compromized only when all nodes fail. If each node has a 1% chance of being down on a particular day, then the chance that the whole system will go down on that day is 1%^n + x, where n is the number of nodes, and x is the chance that the clustering solution is configured wrong or has some bug. With 3 nodes, the chance of having a catastrophic failure on one day is then 0.01 % + x, where, due to the nature of the software written for high availability, and the people who are interested in configuring it, x is a very very small number.
Many Single Points of Failure Eliminated
By employing redundant servers configured for high availability, we can eliminate several points of failure:
- Multiple Internet providers can be used, so if one fails, the other may still work
- Multiple locations - if power goes down in one place, a server in another placeis likely to still have power and an Internet connection, and be able to take over as a primary server.
- Multiple servers - if a server (or node) becomes disfunctional, others stand ready to take its place
- Multiple DNS servers on different IP addresses - if one goes down, the others take over. Raw DNS can be used to provide a kind of load balancing - each time a web browser looks up a web server address, it receives a list of IP addresses (in random order). The web browser tries the servers one by one until one works. In the event of a downed server, the user would experience a slowdown, but not a service disruption.
High Availability is Not An Excuse to Not Do Backups
Just because the system is engineered to never go down, does not mean that the system administrators can rest assured that it will never happen. Even if it is extremely unlikely, it is only a matter of when, not if, a catastrophic failure will occur.... And due to the complexity of the system, when the failure occurs, some pretty damned good backups will be needed to effect a timely restoration.
