DR 101: Recovery Point Objective (RPO)
The disaster recovery term Recovery Point Objective (RPO) is a widely used term. However, it’s often difficult to provide a simple definition and explain its important in the recovery process. But here it goes…
Recovery Point Objective is the point in time you can recover to in the event of a disaster.
So, if you have a disaster (data corruption, ransomware. power outage, user error, etc.) then you will lose all of the data up to your set RPO. If you have an RPO of 4 hours on your critical applications then this means you would lose 4 hours of data, as 4 hours ago is the last point in time to which you can recover.
The cost of just 1 hour of lost data for any size business can be a significant amount and as you scale upwards this becomes an even larger impact. If we take a sample organization with a turnover of $100m you can see the potential impact:
It’s impossible to know when a disaster will strike and how much data loss will occur. You could be lucky and have a disaster during off hours and lose no data, but this assumes you even have the concept of “off hours” in your organization. Or you could be really unlucky and have a disaster strike at your busiest period. Either way, expect the unexpected!
Due to the importance of RPO on data loss, it is recommended to agree on an acceptable and achievable RPO on a per application basis with basic SLAs such as the below:
- CRM System — 1 hour RPO
- Finance System — 1 hour RPO
- Email — 2 hour RPO
- File Servers — 4 hour RPO
- Directory Service — 8 hour RPO
- Print Servers — 24 hour RPO
- Development Servers — 24 hour RPO
If you have a BC/DR plan to deliver the above RPOs, you may think you are covered, but you could be wrong.
The reason being is that you would always be “red lining” your achieved RPO to your SLA. Meaning that by replicating on an hourly basis with perhaps a SAN based snapshot, the best you will ever do is meet the SLA. However, if there is a huge amount of data change you might start to miss that SLA and not to able to recover to a point acceptable to the business.
You should always aim to achieve the lowest RPO possible, then configure alerts to warn if you are in danger of the achieved RPO getting close to your defined SLA. In order to ensure low priority applications don’t impact the RPO on high priority applications, a priority and Quality of Service (QoS) setting should be applied to individual replication streams. This ensures they are prioritized accordingly in the circumstances of high IO and/or low bandwidth.
By applying QoS you can ensure that any available bandwidth is used to maintain a consistently low RPO across all of your applications, yet if the bandwidth becomes constrained only the high priority applications continue to maintain the low RPO.
Another term you should know is the Recovery Time Objective (RTO), which can be defined as: “The time that it takes to recover data and applications”. This means that in the event of a disaster, such as a system wide virus, a user error deleting production data, or a key hardware failure, the RTO is the time it will take to recover from this disaster and have the data and applications back online and running in your recovery site. To learn more about RTO click here.