Part 5: High Availability (HA)
Let’s start out by defining what high availability is. High availability (HA) refers to systems that are durable and likely to operate continuously, without failure, for extended periods of time. HA implies that parts of a system have been fully tested and, in many cases, the system design incorporates redundant components to accommodate failures within the system. The availability of HA systems is measured by what percent of time the system is available, or up time, in a given year. The target availability percentages for systems are usually defined in a service level agreement (SLA), and are typically defined by the number of 9s of availability as shown in the following table:
There are several methods to add High Availability to critical systems beyond the hardware and software method that was just covered. With the advent of server virtualization in the x86 space, hypervisors have begun adding functionality to address HA such as VMware’s HA and FT features. The hypervisor-based HA feature simply reboots a virtual machine on another physical server in the event of a server failure. This is not a technique that would be used when trying to achieve five nines of availability for mission critical applications, but it is frequently used to provide higher levels of availability to less critical applications, particularly since it can be added to a virtual server with a single check box in its configuration. VMware’s FT (Fault Tolerance) feature takes this a step further. FT actually creates a second instance of a running virtual server on a separate hypervisor host and keeps the memory and CPU states in lock step, so both the primary and secondary virtual server are executing the same commands at the same time. VMware disconnects the input and output from the secondary virtual server to prevent conflicts, but in the event of a failure on the primary server, transfers all input and output functionality to the secondary server almost instantly. In most cases, users never notice that the primary server went down. This can be used to achieve five nines or above for many applications. FT does have some limitations which prevent it from functioning with some applications.
Most organizations target five 9s of availability for mission critical systems, which equates to 5.25 minutes of downtime per year. To achieve this level of availability, systems have redundant power supplies, RAID (redundant array of inexpensive drives) disk storage, and ECC error correcting RAM. In addition they will have multiple servers with some form of software clustering feature that allows an entire server to fail without causing interceptions to availability. In this traditional HA implementation, both the hardware and software are made as resilient as possible.
All of the methods of implementing HA that I have touched on so far have one critical thing in common, they rely on shared storage, or nearly synchronous data replication. This is important because it limits these forms of HA to being housed in a single data center, or at best in multiple data centers that are very close to each other to minimize latency. This is important because it means that a disaster like a fire or weather related incident can take these HA systems down and break the SLAs.
A more recent form of HA is completely software based. To implement this form of HA, applications need to be designed not just to tolerate failures, but to expect them. Most people refer to this type of application as a cloud ready, or cloud aware application. They are designed from the ground up to work on large numbers of inexpensive servers. They are also typically designed to distribute their data across these servers which allows greater distances between groups of servers and better resilience to disasters. This is a huge undertaking and is beyond the capabilities of many companies, however as more and more applications are moving to a cloud-based software as a service (SaaS) model, they are being implemented with this type of architecture.
There are many ways to implement HA, and HA can be implemented to achieve different levels of availability. The business challenge is determining which systems need HA, what level of HA do they need, and what is the most cost effective way to achieve these levels, while balancing the costs required to implement the HA policies. The best way to start is not to look at things system by system, but instead to look at business processes and determine which business processes are most sensitive to disruption, and what the cost of those disruptions are. From there you can look at the systems that support those business processes and develop your HA policies.
Two Final Points on High Availability
First, HA provides absolutely no backup or archival functionality. Second, backup and archival policies and technologies may need to be adjusted to accommodate HA. Many backup and archive processes require temporary interruptions to servers to do things such as quiescing a database. Remember that five 9s of availability only allows 864 milliseconds per day for a system to be unavailable.
Proper planning and design of HA solutions is critical. Knowing how HA can be used to facilitate other needs such ad disaster recovery, as well as the impact of backup and archive on HA will lead to more successful implementations. HA is like an insurance policy, you implement it hoping you don’t need to use it, and justify it by identifying the cost of not having it.
Next up, Part 6: Disaster Recovery Planning
Get the FREE eBook
This is part 5 of 10 in the From High Availability to Archive: Enhancing Disaster Recovery, Backup and Archive with the Cloud series. To read them all right now download our free eBook.