Top 5 Disaster Recovery Planning Mistakes
This post was contributed by Joshua Stenhouse, Zerto’s UK-based Solutions Engineer.
I spend every working day discussing disaster recovery and assisting with trials of Zerto Virtual Replication. I also implemented Zerto as an end user myself and so based on all of this experience I’d like to share the top 5 most common disaster recovery mistakes I see people making in virtual infrastructures:
1. Not performing end user acceptance-based testing
In many virtual environments, with complex enterprise applications, simply checking that the services have started and the Virtual Machines (VMs) can communicate isn’t a true disaster recovery test. Yes, it’s important that these are checked, but a real disaster recovery test is an end user (commonly application owner) actually checking that the application works in a test failover. In many applications I have seen all of the services start, and yet vital functionality was broken due to interdependent VMs missing from the failover test. When I implemented Zerto I used a combination of Zerto vCenter permissions to delegate the ability to start failover tests (and not failover!), selected a vCenter folder to bring the VMs online (the application owner also had access to the VM console) and then configured alerts if the application was not tested every 3 months. This then allowed me to simply provide the ability to do failover testing with no impact on production and no break in the replication. Also important: I did this without having to figure out how to actually use any of the 50 applications I was responsible for protecting!
2. Not replicating required VMs
This might seem obvious but until you have completed a successful disaster recovery test (including end user acceptance) then you really don’t always know exactly which VMs are required for disaster recovery. There are many commercial tools for trying to assess connections between servers, but these can be complicated and I often found many connections were in fact redundant. It was only when we performed user acceptance testing that I truly realised exactly which VMs needed to be protected.
3. Presuming a 24 hour Recovery Point Objective is actually ok
I ask many people what the business expects as a reasonable recovery point objective (RPO; the highest amount of data you are willing to lose) and I often get answers like “24 hours is good for us”, “1 hour is ok for my key applications” or “15 minutes is fine”. From my own personal experience while these can be reasonable SLAs, just because the business is ok with this doesn’t mean it doesn’t actually want significantly better! If the business did actually lose this amount of work/data the impact could be many times the cost of actually implementing any disaster recovery solution. Even at the small company with 100 users I worked at when I first started in IT in 2004, we had the potential to lose £10,000+ per hour if data was lost and productivity ceased. The SLA from the business was 24 hours, but I can guarantee if we had ever had a disaster and actually lost a full 24 hours I either would be out of a job the next day or certainly would not be up for a promotion next time around! We never had a disaster but if we did I would have wished for Zerto — providing me with continuous data protection and replication with a RPO of just seconds.
4. Not having a plan to actually use or allow access to the failover VMs
I find this topic is rarely discussed and therefore often the most over looked aspect of disaster recovery. If you have lost your primary site, you’re now running in your recovery site (in minutes and not hours thanks to Zerto!) then how would you give users access to their data and applications? The fact that with Zerto you have only lost seconds of data is great, but the impact on business productivity and revenue will start to increase the longer nobody can actually do any work! I know there are many presumptions to this scenario but I recommend to apply this question to your own environment and ask yourself how could you give all of your users access to their applications and data? VMWare’s Duncan Epping alludes to this in his post, “Prepare for the Worst” where he explains the need to think “more about the strategy, the processes that will need to be triggered in a particular scenario” not just on the IT. My preferred solution was a recovery site VPN and some replicated terminal servers with the instructions for access written into the disaster recovery plan.
5. No written disaster recovery plan
This one builds on #4 above. With the complexity of the old way of doing replication and disaster recovery it is very easy to forget the most important aspect of disaster recovery, actually writing down a plan. Companies focus solely on just trying to get everything replicated between the storage arrays then mapped to the virtual infrastructure in completely different interfaces. In Zerto everything from the replication, management, protection groups, failover and failover testing is managed from our single interface. When I installed Zerto I simply specified my SLAs for replication, created my virtual protection groups, selected the VMs to protect and then Zerto took care of all the replication in the background. How simple is that? This made protecting VMs so easy that it gave me the time to actually write everything down and form a plan!
Hopefully you’ve found my blog on the top 5 most common disaster recovery planning mistakes in virtual infrastructures of interest. Please feel free to add a comment, ask a question or share any of your own commonly-seen disaster recovery mistakes.