Part 8: Disaster Recovery Tools
In the last two posts I covered why DR is needed, what goes into creating a DR plan, and various locations that can be used as recovery sites. Now it’s time to dive into the disaster recovery tools and methods that can be used to meet your RPOs and RTOs. There are many disaster recovery tools and techniques, so I will cover a few of the main ones, including:
- Consistency groups
- Tape backup and related variants
- Storage replication
- OS level replication
- Application replication
- Hypervisor based replication
- The use of SaaS
You will recall from a previous post that DR is all about getting the technology systems that support business processes back up and running. This requires that the servers and systems are recovered in a manner that ensures data is not corrupted and will not cause issues with the associated business processes. Sounds simple right? Well, some applications require multiple servers and multiple data sources to function properly. If the data is out of sync between these systems it can, and will, cause issues. To help with this challenge the concept of consistency groups was developed. A consistency group allows you to group all of the servers related to an application together so they are all backed up or replicated at the same point in time to insure that their data is in sync, or consistent. Consistency groups are not supported by all disaster recovery tools.
Tape backup is one of the oldest disaster recovery tools. In the event of a disaster, you grab all of your tapes, get them to the DR site and start restoring. I call this the PTAM, or pickup truck access method, since you may need to transport quite a few tapes, and I have seen people load them in a truck for transport. Some more modern variants of this, which eliminate the need to physically transport media are disk-to-disk-to-disk backups where one of the disk pools is at the recovery site, and disk-to-VTL-to-VTL, where one of the virtual tape libraries is at the recovery site. Many, but not all tape backup systems can be configured to use consistency groups. However due to the nature of tape backup systems, they are typically only viable for medium to high RTOs – typically greater than 4 hours. Tape is also usually limited to larger RPOs since backups are run once per day.
With the advent of storage area networks (SANs), the ability to use storage-based replication became a popular tool for DR. With storage based replication, SAN-based storage appliances are configured to replicate data from one storage unit to another automatically. There are two types of storage-based replication, synchronous and asynchronous.
With synchronous replication, when a server writes data to its local storage device the storage device immediately sends the data to the secondary storage device. Once the data has been written to disk on the second device it sends back an acknowledgment that the data has been written. When the primary device receives this acknowledgment, it sends an acknowledgement to the server that the data has been written. This ensures that the data is safely stored on both devices before the server/application tries to write more data. The problem with synchronous data replication is physics. Data can only travel so fast between the two disk systems. According to Einstein, it can only travel as fast as light. The speed of data is actually slower than light, but I’ll use the speed of light for my example. Latency is the time it takes data to travel a given distance. The following diagram shows where latency is introduced in a synchronous replication system.
I won’t go into the gory math, but for this example let’s assume that the distance between the server and primary SAN is 100 feet, and the distance between the primary and secondary SANs is 10 miles. I am going to ignore the 100 feet between the server and primary SAN since it is insignificant in these calculations, and it will be there weather or not we are doing replication. By waving a magic wand and skipping the math I can tell you that distance introduces 8.2 microseconds per mile. So replication has introduced another 10 miles for information to travel before the server can write more data. Actually we have twice that since the data needs to travel the 10 miles to get to the secondary SAN and then the acknowledgement needs to travel 10 miles back. This means our replication has introduced 164 microseconds of latency to every di write. This isn’t a lot of latency, but it might be noticeable in a high performance database application. If we increase the distance between the SAN devices to 100 miles, we are now introducing almost 2 milliseconds of latency every time data is written to disk. This will be very noticeable. The point of this is that synchronous replication is only viable if your disaster recovery site is dangerously close to your production site.
With asynchronous replication the primary SAN device acknowledges that the data is written to disk as soon as it is written to its own local disk. It does not wait for acknowledgment from the secondary SAN device. The primary SAN keeps track of acknowledgments from the secondary SAN so it knows that the data is eventually written to the secondary device. This does not introduce any latency to the server as it writes data. This means distance isn’t a big factor when it comes to using asynchronous replication. It also means that the primary and secondary SANs could be seconds to minutes out of synchronization based on how your asynchronous replication is configured. To summarize, synchronous replication provides an RPO of zero, but has severe distance limitations. Asynchronous replication has no distance limitation, but has an RPO of seconds to minutes.
Next up, Part 9: Disaster Recovery Tools – Replication and Virtualization. Until next time, keep your data protected.
Get the FREE eBook
This is part 8 of 10 in the From High Availability to Archive: Enhancing Disaster Recovery, Backup and Archive with the Cloud series. To read them all right now download our free eBook.