Some of us will get around to it tomorrow. Some of us did it a while back and think it’s probably still up-to-date. A small number of us checked it today and know it’s up-to-date.
The disaster recovery plan so often ends up being the poor relation in the family. We all know it is something we should be reviewing regularly but that’s not always what happens in practice.
There are numerous events that are obvious and clearly require you to review the DR Plan, such as a building or data centre move, application & system migrations or changes to key third party service providers. But what about some of the less obvious events? Even something apparently quite trivial may need to be reviewed to determine if any changes are required to the DR plan.
Example 1 – Replacing Windows Server 2003
That ageing old Windows Server 2003 machine has finally been replaced with a shiny new Windows Server 2012. The critical data was all migrated and the application still runs, although a new version of the software was required for compatibility with Server 2012. So, what about the implications on DR and what questions might I need to answer?
Is the IP address still the same? Is the backup method still the same? Did you have some kit earmarked for DR and will that still be able to cope with the much larger O/S data size and system resources such as CPU and RAM? Hopefully this was something that had already been considered and factored in during the commissioning of the kit, but although you are confident everything is OK, have you actually been able to find time to test out your theory since the new kit was bought online?
Other environmental issues seem insignificant at first, but when did you last check out the emergency contact numbers for the Water, Gas and Electricity supplies to the buildings? And that phone number for the company who provides mobile generators, is it still correct?
Do you have a copy of the staff home and mobile phone numbers, and is that still up-to-date?
Regarding walk-through DR tests, it is probably fairly easy to think of scenarios such as the Domain Control dies, or the critical application becomes corrupted, and then play through the steps needed to recover those systems. But what about non-technical scenarios that impair your ability to support the business?
Example 2 – Remote working
The local fire department discover a gas leak near your office and as a precaution they need to turn off all power to the area and have closed the roads until the environment is safe. So for an hour or two that probably isn’t going to be a problem, but what if that was for 3 days? Some of your systems might be accessible from home via secure links, or maybe your email system is already in the cloud. But what about the hardware that your staff use to do their job? Do you have staff that deal with payments and use secure handheld devices to confirm online banking transactions? How might this affect your cash-flow should it be unavailable and have you got a backup plan in place with the bank?
Do any of your staff have specific software installed on their work machines and cannot function without it?
Can you remotely redirect the phone system to other numbers?
And what about notifying your customers that there is a problem? Can you get to the website from an external connection? Or do you even want it to be public knowledge that you have a problem?
Example 3 – restoring all servers at the same time
You perform a full backup of every critical server at least once per week, and have incremental backups usually once per day. You are confident that each individual server can be restored from the last backup, but you don’t have a full DR site in place and have never tested restoring every server at the same time.
This is not an unusual scenario, but potentially unanswered questions relate to RPO and RTO.
RPO, your Recovery Point Objective, may require that you restore separate but dependant systems and servers to the same point in time. So is it OK that your critical application server is restored to Tuesday night at 3am, the web server front end is restored to 12noon, and the domain controller is restored from the 6pm backup? Maybe thats OK, but has anything changed in those systems since they were implemented five years ago?
Have you reviewed your RTO (Recovery Time Objective) recently? How quickly does the business need to get the systems back up and running? So maybe you know you can restore each server in a couple of hours, but what if you need to restore all 15 servers? And have you reviewed the priority of each restore and the dependencies between servers?
The message really is that if you don’t have a dedicated person looking after Disaster and Business Continuity planning then maybe try and put some time to one side and check out the details. And if you perform tests on a subset of your environment maybe think about some non-technical issues too.
Steve Harcourt, Senior Information Security Consultant at Redstor