DR 101: Recovery Time Objective (RTO)
In disaster recovery the term Recovery Time Objective (RTO) can simply be defined as:
“The time that it takes to recover data and applications”.
This means that in the event of a disaster, such as a system wide virus, a user error deleting production data, or a key hardware failure, the RTO is the time it will take to recover from this disaster and have the data and applications back online and running in your recovery site.
The cost of the downtime associated with waiting for applications and data to be recovered can result in significant loss in revenue and productivity, as your business may have no applications available in order to continue generating revenue. In even the smallest of organizations this can be a significant figure and below is an example for a company with a turnover of $100m:
In this example you can see the potential revenue impact is a significant amount and this is the most basic of calculations of the annual revenue, divided by days in the year and hours in the day. The actual figure can be significantly worse if the disaster occurs in working hours. Additionally, this calculation doesn’t even attempt to quantify the impact on customers, brand identity, market perception, suppliers and share price which can increase the impact exponentially.
It is therefore important for any organization to try and attain the lowest possible RTO in order to minimize the impact of a disaster in a timely manner if and when required. Your RTO should be defined on a per application basis in order to prioritize the recovery of certain applications, in advance of others, depending on their level of criticality. This has the added benefit of ensuring that revenue generating applications are recovered first and ensuring that the IT staff focus on these before anything else.. An example RTO SLA could be:
- CRM System — 4 hour RTO
- Finance System — 4 hour RTO
- Email — 4 hour RTO
- File Servers — 4 hour RTO
- Directory Service — 2 hour RTO
- Print Servers — 24 hour RTO
- Dev Servers — 24 hour RTO
Achieving the above RTOs with any BC/DR technology is not as easy as it seems. Just registering and powering on Virtual Machines (VMs) is not your true RTO and nor should it be the RTO that you communicate as an achievable SLA to the business.
Registering and Powering on a VM is the simplest part of any recovery operation. The most complex and time consuming part is:
- Reconfiguring the VMs to run in the recovery site (such as MAC and IP address changes).
- Restoring from a working point in time where data is consistent.
- Finally ensuring all of the applications can communicate with each other and that they are up and running.
All of this should be done before communicating to the business that the application is back online and ready to use. The time that this whole process takes is your actual RTO and is the one that should be defined in your SLA.
By utilizing a BC/DR technology that can automate the process of registering, powering on VMs in the correct order and automatically reconfiguring IP and MAC addresses, you are going to give yourself the best shot and maintaining a low RTO. If this technology also allows you to try specific points in time, then rollback to a previous point in time, if the first recovery does not work, then you are ensuring recovery is not a “one shot” thing but rather a process.
In order to benchmark your RTO and tweak your BC/DR plan to minimize the time, testing is a must. By testing your plan with a BC/DR technology that allows for testing with no downtime in production, or break in the replication, you can perform a test during working hours to ensure that: first of all you are able recover, then you can run through the recovery operation multiple times to get your RTO as low as possible.
I hope this has given you a good insight into RTO and the things you should take into consideration when applying RTO SLAs to your applications, and how you prove they are achievable.
The disaster recovery term Recovery Point Objective (RPO) is also a widely used term, which means: “The point in time you can recover to in the event of a disaster”. So, if you have a disaster (data corruption, ransomware. power outage, user error, etc.) then you will lose all of the data up to your set RPO. If you have an RPO of 4 hours on your critical applications then this means you would lose 4 hours of data, as 4 hours ago is the last point in time to which you can recover. To learn more about RPO click here.