Part 10: Applying Tools to Your Disaster Recovery Plans
Let’s review RPO and RTO before we get into applying DR tools to your disaster recovery plans. RPO is recover point objective, or how much data can afford to be lost in the event of a disaster. RPOs will typically vary from near zero to up to 24 hours. This directly maps to how often data is copied from the production storage to offsite media that will be used in the disaster recovery process. Not at the production location is key, since presumably it is the production location that suffered a disaster which needs recovering from, and depending on the nature of the disaster, all media at that location may be lost. RTO stands for recovery time objective, or how long can a system can be down in the event of a disaster. RTOs will typically vary from zero to several weeks or months. Some systems may be categorized with RTOs that state they do not need to come back up while in a disaster scenario, but need to come back up once the disaster is over and things are back in a normal production facility.
Servers and systems are typically arranged into a grid based on desired RPOs and RTOs. To facilitate this, different tiers for RPO and RTO are used. Here’s an example using four tiers for RTO and three tiers for RPO:
The Tier 4 for RTO means that the systems do not need to be recovered until things are migrated back to a production site after the disaster. This category generally includes things like test and development systems.
These RPO and RTO groups are just an example. Every organization should choose RPO and RTO times that meet their business needs.
The tiers used for this example give us a grid of 12 possible combinations.
I’ve color coded this grid to group certain tier combinations by the types of tools that can be used to meet requirements.
Many of the tools and techniques covered here can be used for many of the RPO/RTO tiers. The challenge is selecting the set of tools and techniques that provide you with a balance of cost, functionality and ease of management. For example, storage replication could potentially be used to cover all twelve combinations of RPO and RTO outlined above. This solution would be easy to manage, but would be one of the most expensive solutions if implemented for every single application. On the other end of the spectrum, a tape or virtual tape backup solution could be implemented as a single solution for DR for an organization. It could be configured to be relatively easy to manage and very cost effective, however, it would not have the functionality required for the applications with a low RPO. By combining several tools and techniques, with some of them covering multiple tiers, an optimal solution can be developed for your organization and included in your disaster recovery plans.
Tier 1A is basically HA extended into the disaster recovery plans. This requires near real time replication to the DR site and either automated fail over or the ability to simply power the system on at the DR site. This tier typically consists of business critical database servers. The best option for this is application based replication. There are some applications that have even shorter RTOs, but these venture into the realm of fault tolerant applications, which is beyond the scope of this article. Storage, OS, and Hypervisor replication may also seem like viable options, however, they produce “crash consistent” copies of the data at the DR site. Crash consistent means that the data is in a condition similar to a server that has crashed or had the power turned off without a proper shutdown procedure. The impact of this is that the replicated data may be slightly corrupted and might require additional work such as rolling back database logs to be brought back to a consistent state. This additional effort in the event that the data is slightly corrupted will easily take the recovery time out of the <15 minute range of RTO Tier 1.
Application replication replicates data in a consistent manner, meaning it sends data and meta-data in a format that allows the server at the DR site to process the information and recover to the most recent successfully completed transaction. This eliminates the potential requirement for additional data remediation steps and keeps the recovery time and recovery point within the standards set forth for Tier 1A. Some organizations will use storage, OS or Hypervisor replication for Tier 1A because of the low probability of corrupted data at the DR site due to modern journaling file systems. But there is still a risk of corrupted data adding to the recovery time. Some storage vendors do have additional add on features to reduce the amount of time required to mitigate crash consistent data. Storage, OS, or Hypervisor based continuous replication may be required if an application does not support application replication.
Applications that fall into the 1B category need to be recovered quickly but can afford some data loss. These are typically utility applications that support mission critical applications, and in some cases, file servers. This tier can use any of the tools and techniques described for tier 1A, as well as tools that support asynchronous replication. Database servers typically don’t fall into this tier so crash consistent replicas become more acceptable. For servers in this tier that are virtualized, Hypervisor based replication tools provide a cost effective solution. For non-virtualized servers, OS replication tools and asynchronous SAN-based replication can be used. Due to the low RTO, non-virtualized servers will require dedicated physical servers to be recovered on.
Tier 1C requires servers to be brought online quickly, but is not concerned with data loss. This seems like an odd combination. Why wouldn’t an organization be concerned about data loss on a server? The answer is surprisingly simple. There are many application and web servers that don’t actually save any data, they simply process or present information. The only practical concern about this type of server is that it gets restored with all of the most recent patches and updates applied. To meet these requirements, tools that replicate the servers on a daily basis, or even a process that only replicates the servers when they are patched or updated are ideal. The challenge is that the servers need to be recovered quickly, which is easy for virtualized servers, and difficult/costly for physical servers. This is a sweet spot for hypervisor replication. Some hypervisor replication tools also have agents that can be installed on physical servers so they can be recovered as a virtual server in the event of a disaster. If it is not possible to recover some physical servers as virtual servers in the event of a disaster, you may need to use some of the other techniques described for tiers 1A and 1B.
Tiers 2A & 3A
Tiers 2A and 3A have a low tolerance for data loss, but do not need to be immediately available after a disaster is declared. This means that nearly continuous data replication is needed, but an immediately bootable copy of the operating system and applications is not required. Any of the tools and techniques described for tiers 1A, 1B, and 1C can be used to meet these criteria, however, they will probably not be the most cost effective for the tier 2A and 3A systems. For most of these systems it is perfectly acceptable to simply replicate the data associated with the applications in near real time, but not the actual operating system or application itself. This can be done most easily with storage based replication for the data, and more traditional backup solutions for the operating system and applications, with these backups being replicated on a daily basis. This allow the servers and applications to be recovered from backups and then configured to use the replicated data.
Tiers 2B, 2C, 3B, 3C, 4B, and 4C
This group of tiers encompasses applications and servers ranging from those that can tolerate some data loss and have to be back online within a day of a disaster being declared to those that may never need to be recovered. This group can be protected most cost effectively by using traditional daily backup methods and replicating the backups to the disaster recovery site for the servers in the 2B and 2C tiers at a minimum, and replicating the backups for the remaining servers to another location such as inexpensive cloud storage to reduce cost, while still preserving the data.
This is a unique tier, requiring little to no data loss and not critical to restore. This tier, if it contains anything, will typically contain data stores used for archiving data needed to meet compliance regulations. This is typically achieved by replicating the data to low cost cloud storage using an NAS cloud gateway. This type of solution provides a network share that automatically replicates anything stored on it to cloud storage. It is important to turn on versioning for cloud storage to keep all versions of the data and to protect against data corruption by malware and crypto-locking attacks.
This article contains a high level example of mapping tools and techniques to various RPO / RTO requirements. There are a lot more things to consider when building actual disaster recovery plans, such as consistency groups, physical limitations of recovery systems and sites, as well as many others which are beyond the scope of this series. When it comes to disaster recovery a general rule of thumb is that the lower the RTO and/or RPO the more the solution will cost.
This series of articles has provided a high level overview of many topics, and is by no means an exhaustive reference on any of them. I hope you found the series informative, and that it gave you a greater appreciation of the differences between backup, archiving, and disaster recovery, as well as insight into how properly designed and implemented solutions can complement each other to reduce costs and provide better overall services to meet your business needs.
Get the FREE eBook
This is part 10 of 10 in the From High Availability to Archive: Enhancing Disaster Recovery, Backup and Archive with the Cloud series. To read them all right now download our free eBook.