Disaster Recovery in Cloud Computing: Writing and testing a disaster recovery plan

Writing and testing a disaster recovery plan is one of the key elements of business continuity management. Traditionally business continuity and disaster recovery (DR) planning have always been separated between the business and the information technology (IT) department.

It has long been recognised that this “divide” creates more problems that it solves, after all most businesses could not continue to operate successfully if their IT services were unavailable for a period of time, depending on the nature of your business this may well range from a few hours to several days.

The launch of BS 25999 has established a Business Continuity Management (BCM) standard which intrinsically links BCM, Incident Management, and IT DR. Essentially the key message is to have true business continuity you must also have strong capability.A disaster recovery plan should interface with the overall business continuity management plan, be clear and concise, focus on the key activities required to recover the critical IT services, be tested reviewed and updated on a regular basis, have an owner, and enable the recovery objectives to be met.

Recovery Objectives

The two key recovery objectives which many people are familiar with are: the recovery time objectives, how long can my business continue to function without the critical IT services (how quickly must I recover the service from the “decision to invoke”) the recovery point objective, from what time in my processing cycle am I going to recovery my data (how much data am I prepared to lose or have to re-enter from an alternate source).

There are several options, these are:

* Zero data loss, recovery to the point of failure
* Start of the current business day (SoD)
* End of the previous business day (EoD)
* Intraday

Intraday is a point between the last available backup either SoD or EoD and the failure, for arguments sake midday period end, the weekly or monthly backup

Additionally there is an additional measure, this is the Maximum Tolerable Outage (MTO), the MTO is the maximum time that my business will survive from the initial service interruption.

The recovery objectives must be based upon solid business requirements identified by the Business Impact Analysis (BIA) process.

This figure above clearly demonstrates the correlation between the incident starting, the reporting process, the investigation process, the decision making process, and the recovery process. If the MTO is 12 hours and the IT DR process takes 8 hours to perform from the invocation point then the decision to invoke has to be made within 4 hours of the initial incident.

Knowing this “lead time” is crucial to implementing an effective incident management and escalation process. The recovery time objective is where most misunderstanding occurs between the Business and IT Department.

The message from IT to the Business is “of course we can recover services within your required recovery time”.

The hidden message is assuming we start the recovery immediately the incident in detected. Generally speaking many organisations usually recover from minor incidents or service interruption well within their MTO.

The following diagram gives a high level incident management and DR invocation flow:

Disaster Recovery Plan Objectives

The key objective of a disaster recovery plan is to detail the key activities required to reinstate the critical IT services within the agreed recovery objectives. The most effective start point for any DR plan is the “Declaration of a Disaster” once an incident has been deemed serious enough that “forward fixing” at the primary location is impractical or is likely to result in an outage expending beyond the Maximum Tolerable Outage.

There are a number of common mistakes which organisation make when creating a DR plan, these relate to the level of detail they contain and the “standalone” nature of their construction.

So what level of detail should the plan contain?

The answer will depend on who you ask, the more people you ask the varied number of replies you will receive. It is advisable to keep the IT DR plan as concise as possible and focus only on the key information required at the time of a disaster.

So what information should the DR plan contain?

As a minimum the plan should contain the following information:

A statement detailing the scope and capability of the DR Plan, exactly when should this plan be used and what “consequences” are covered. It is advisable to focus on the consequences of an incident rather than the cause.

Why focus on consequences rather than the cause?

It is really important why the data centre is destroyed? As far as the DR Plan is concerned the answer is no. The same process and recovery stages will be followed regardless of the cause, fire, flood, terrorist incident, or the proverbial aircraft impact will all result in the partial or total destruction of the data centre.

The only relevant question is what is the impact and can I realistically continue to host servcies from my primary site or should I invoke and recover/resume the critical services at my secondary site.

A description of the key roles and responsibilities so that anyone assigned to a particular role in the recovery team understand what is required of them. Should you name individuals in the plan? Ideally individuals who are to be expected to perform a particular role should already be aware that they are likely to be called upon and should have received the relevant training. It is advisable to record the names and contact details of individuals in the relevant section of the overall BCM plan rather than the DR Plan. There is no reason why the individual names at the time can’t be entered into the recovery log as the “designated recovery manager” or other predefined role.

A summary of the critical services, their recovery objectives and recovery priorities, this information may be lifted from the Business Impact Analysis (BIA) performed as part of the overall BCM process. Summarising them in the invocation plan will remove the inevitable discussions at the time of the incident and provide a reference point for the recovery teams. Third party contact details, particularly those that may be required to assist in the recovery effort or those that provide recovery servcies, for example:

The secondary (DR) data centre service provider, you will need contact details, address, maps, and of course the invocation process and codes. It is advisable to do this as soon as it becomes clear the incident is likely to become a disaster recovery situation. You can always “stand down” if the incident can be forward fixed (some service providers may levy a charge for this);

Your media handling company. Are your disaster recovery tapes removed from your data centre and vaulted off-site? If so you will want to arrange for them to be retrieved and sent to your recovery centre at the earliest opportunity; Mobilisation of the recovery teams.

What teams and individuals need to be contacted to recovery the services, at this stage of the recovery the incident management team will already know the extent of the incident and should (if not you need to make sure you do at the earliest opportunity) have placed the recovery teams on standby.

The plan should teams and skills required, not individuals. Individual contact details have to be recorded somewhere, it is normal practice, as part of the overall business continuity management program to have “contact lists”, rather than repeat the detailed contact information the DR Plan should reference the relevant sections in the BCM plan.

Detailed recovery activities and sequence of events, including pre-requisites, dependencies, and responsibilities.

What level of detail should you include in this section of the DR Plan?

This is very much down to personal choice, however, as a minimum you should include:

The recovery process and flow of activities;high level activities, for example, load operating systems, install application software, restore data, synchronise database, make configuration changes, post recovery checks, open service to users; pre-requisites and dependencies for each activity; responsibilities, who will perform each activity.

Should you include the detailed activities for installing an operating system or restoring a database? The detailed recovery activities should be held locally by the team responsible for performing these activities. There are several reasons for this, the “how do I install Windows” instructions will be used for business as usual activities, minor incidents, and disaster recovery. The DR Plan only needs to reference these documents, if you find it an absolute necessity to include these in your DR Plan then do so as an appendices and not in the main body of the document, don’t allow key purpose of the DR Plan to be lost in un-necessary or duplicated detail.

Testing the Disaster Recovery Plan

IT DR Testing should be performed on a regular basis, the exact frequency very much depends on your own organisational needs. However, it is usual for “full deployment” tests should be performed, as a minimum, on an annual basis. There are of course other “trigger points”, for example, a change in your infrastructure that affects your disaster recovery strategy, i.e. moving from active/contingency recovery model to an active/passive recovery model.

What do I test?

This question is probably the most common question asked, and the answer is simple, you test the plans, the process, the people, and the infrastructure, in fact every component required to recovery and resume your critical IT services.

What are the key objectives of a DR test? There are several key objectives of a test, the main ones are:

Exercise the recovery processes and procedures familiarise staff with the recovery process and documentation; verify the effectiveness of the recovery documentation; verify the effectiveness of the recovery site; establish if the recovery objectives are achievable; identify improvements require to the DR strategy, infrastructure, and recovery processes;

What is the scope of a DR test?

The scope will very much depend on the maturity of your DR strategy and capability, it is important to scope the test to stretch the objectives and success criteria of the previous test, for example, if this is your first test, you may not want to have the entire user community scheduled to come in and perform lots of testing, you may wish to limit the scope to just IT staff and maybe a couple of “friendly users” to test functionality. Depending on the complexity of your environment it may take several tests to build confidence and perform a “full deployment” test.

Common DR testing mistakes are:

Operating within your comfort zone, for example, recovering the servers you know you can recovery whilst avoiding the more difficult components

Not testing the recovery of a service but focusing on the hardware, systems, and applications. Remember, a particular service may require several servers to be recovered, it may also require data held on local drives and network attached devices, and network connectivity from the data centre to the user. trying to achieve too much too soon and overstating your DR capability and readiness

Not planning appropriately, testing and live invocation are very different. In a live invocation you do not have a live environment to protect. Consider the impact that testing may have on your live services.

Engage with the appropriate people at an early stage, a “full deployment” test may take several weeks to plan.

Disaster Recovery in Cloud Computing

Wednesday, July 21, 2010

Writing and testing a disaster recovery plan

Recovery Objectives

Disaster Recovery Plan Objectives

So what level of detail should the plan contain?

So what information should the DR plan contain?

Why focus on consequences rather than the cause?

What level of detail should you include in this section of the DR Plan?

Testing the Disaster Recovery Plan

No comments:

Wednesday, July 21, 2010

Writing and testing a dis­aster recov­ery plan

Recov­ery Objectives

Dis­aster Recov­ery Plan Objectives

So what level of detail should the plan contain?

So what inform­a­tion should the DR plan contain?

Why focus on con­sequences rather than the cause?

What level of detail should you include in this section of the DR Plan?

Testing the Dis­aster Recov­ery Plan

No comments:

Writing and testing a disaster recovery plan

Recovery Objectives

Disaster Recovery Plan Objectives

So what information should the DR plan contain?

Why focus on consequences rather than the cause?

Testing the Disaster Recovery Plan