Wednesday, July 21, 2010

Writing and testing a dis­aster recov­ery plan

Writing and testing a dis­aster recov­ery plan is one of the key ele­ments of busi­ness con­tinu­ity man­age­ment. Tra­di­tion­ally busi­ness con­tinu­ity and dis­aster recov­ery (DR) plan­ning have always been sep­ar­ated between the busi­ness and the inform­a­tion tech­no­logy (IT) department.

It has long been recog­nised that this “divide” creates more prob­lems that it solves, after all most busi­nesses could not con­tinue to operate suc­cess­fully if their IT ser­vices were unavail­able for a period of time, depend­ing on the nature of your busi­ness this may well range from a few hours to several days.

The launch of BS 25999 has estab­lished a Busi­ness Con­tinu­ity Man­age­ment (BCM) stand­ard which intrins­ic­ally links BCM, Incid­ent Man­age­ment, and IT DR. Essen­tially the key message is to have true busi­ness con­tinu­ity you must also have strong capability.A dis­aster recov­ery plan should inter­face with the overall busi­ness con­tinu­ity man­age­ment plan, be clear and concise, focus on the key activ­it­ies required to recover the crit­ical IT ser­vices, be tested reviewed and updated on a regular basis, have an owner, and enable the recov­ery object­ives to be met.

Recov­ery Objectives

The two key recov­ery object­ives which many people are famil­iar with are: the recov­ery time object­ives, how long can my busi­ness con­tinue to func­tion without the crit­ical IT ser­vices (how quickly must I recover the service from the “decision to invoke”) the recov­ery point object­ive, from what time in my pro­cessing cycle am I going to recov­ery my data (how much data am I pre­pared to lose or have to re-enter from an altern­ate source).

There are several options, these are:

* Zero data loss, recov­ery to the point of failure
* Start of the current busi­ness day (SoD)
* End of the pre­vi­ous busi­ness day (EoD)
* Intraday

Intra­day is a point between the last avail­able backup either SoD or EoD and the failure, for argu­ments sake midday period end, the weekly or monthly backup

Addi­tion­ally there is an addi­tional measure, this is the Maximum Tol­er­able Outage (MTO), the MTO is the maximum time that my busi­ness will survive from the initial service interruption.

The recov­ery object­ives must be based upon solid busi­ness require­ments iden­ti­fied by the Busi­ness Impact Ana­lysis (BIA) process.

This figure above clearly demon­strates the cor­rel­a­tion between the incid­ent start­ing, the report­ing process, the invest­ig­a­tion process, the decision making process, and the recov­ery process. If the MTO is 12 hours and the IT DR process takes 8 hours to perform from the invoc­a­tion point then the decision to invoke has to be made within 4 hours of the initial incident.

Knowing this “lead time” is crucial to imple­ment­ing an effect­ive incid­ent man­age­ment and escal­a­tion process. The recov­ery time object­ive is where most mis­un­der­stand­ing occurs between the Busi­ness and IT Department.

The message from IT to the Busi­ness is “of course we can recover ser­vices within your required recov­ery time”.

The hidden message is assum­ing we start the recov­ery imme­di­ately the incid­ent in detec­ted. Gen­er­ally speak­ing many organ­isa­tions usually recover from minor incid­ents or service inter­rup­tion well within their MTO.

The fol­low­ing diagram gives a high level incid­ent man­age­ment and DR invoc­a­tion flow:

Dis­aster Recov­ery Plan Objectives

The key object­ive of a dis­aster recov­ery plan is to detail the key activ­it­ies required to rein­state the crit­ical IT ser­vices within the agreed recov­ery object­ives. The most effect­ive start point for any DR plan is the “Declar­a­tion of a Dis­aster” once an incid­ent has been deemed serious enough that “forward fixing” at the primary loc­a­tion is imprac­tical or is likely to result in an outage expend­ing beyond the Maximum Tol­er­able Outage.

There are a number of common mis­takes which organ­isa­tion make when cre­at­ing a DR plan, these relate to the level of detail they contain and the “stan­dalone” nature of their construction.

So what level of detail should the plan contain?

The answer will depend on who you ask, the more people you ask the varied number of replies you will receive. It is advis­able to keep the IT DR plan as concise as pos­sible and focus only on the key inform­a­tion required at the time of a disaster.

So what inform­a­tion should the DR plan contain?

As a minimum the plan should contain the fol­low­ing information:

A state­ment detail­ing the scope and cap­ab­il­ity of the DR Plan, exactly when should this plan be used and what “con­sequences” are covered. It is advis­able to focus on the con­sequences of an incid­ent rather than the cause.

Why focus on con­sequences rather than the cause?

It is really import­ant why the data centre is des­troyed? As far as the DR Plan is con­cerned the answer is no. The same process and recov­ery stages will be fol­lowed regard­less of the cause, fire, flood, ter­ror­ist incid­ent, or the pro­ver­bial air­craft impact will all result in the partial or total destruc­tion of the data centre.

The only rel­ev­ant ques­tion is what is the impact and can I real­ist­ic­ally con­tinue to host ser­vcies from my primary site or should I invoke and recover/resume the crit­ical ser­vices at my sec­ond­ary site.

A descrip­tion of the key roles and respons­ib­il­it­ies so that anyone assigned to a par­tic­u­lar role in the recov­ery team under­stand what is required of them. Should you name indi­vidu­als in the plan? Ideally indi­vidu­als who are to be expec­ted to perform a par­tic­u­lar role should already be aware that they are likely to be called upon and should have received the rel­ev­ant train­ing. It is advis­able to record the names and contact details of indi­vidu­als in the rel­ev­ant section of the overall BCM plan rather than the DR Plan. There is no reason why the indi­vidual names at the time can’t be entered into the recov­ery log as the “des­ig­nated recov­ery manager” or other pre­defined role.

A summary of the crit­ical ser­vices, their recov­ery object­ives and recov­ery pri­or­it­ies, this inform­a­tion may be lifted from the Busi­ness Impact Ana­lysis (BIA) per­formed as part of the overall BCM process. Sum­mar­ising them in the invoc­a­tion plan will remove the inev­it­able dis­cus­sions at the time of the incid­ent and provide a ref­er­ence point for the recov­ery teams. Third party contact details, par­tic­u­larly those that may be required to assist in the recov­ery effort or those that provide recov­ery ser­vcies, for example:

The sec­ond­ary (DR) data centre service pro­vider, you will need contact details, address, maps, and of course the invoc­a­tion process and codes. It is advis­able to do this as soon as it becomes clear the incid­ent is likely to become a dis­aster recov­ery situ­ation. You can always “stand down” if the incid­ent can be forward fixed (some service pro­viders may levy a charge for this);

Your media hand­ling company. Are your dis­aster recov­ery tapes removed from your data centre and vaulted off-site? If so you will want to arrange for them to be retrieved and sent to your recov­ery centre at the earli­est oppor­tun­ity; Mobil­isa­tion of the recov­ery teams.

What teams and indi­vidu­als need to be con­tac­ted to recov­ery the ser­vices, at this stage of the recov­ery the incid­ent man­age­ment team will already know the extent of the incid­ent and should (if not you need to make sure you do at the earli­est oppor­tun­ity) have placed the recov­ery teams on standby.

The plan should teams and skills required, not indi­vidu­als. Indi­vidual contact details have to be recor­ded some­where, it is normal prac­tice, as part of the overall busi­ness con­tinu­ity man­age­ment program to have “contact lists”, rather than repeat the detailed contact inform­a­tion the DR Plan should ref­er­ence the rel­ev­ant sec­tions in the BCM plan.

Detailed recov­ery activ­it­ies and sequence of events, includ­ing pre-requisites, depend­en­cies, and responsibilities.

What level of detail should you include in this section of the DR Plan?

This is very much down to per­sonal choice, however, as a minimum you should include:

The recov­ery process and flow of activities;high level activ­it­ies, for example, load oper­at­ing systems, install applic­a­tion soft­ware, restore data, syn­chron­ise data­base, make con­fig­ur­a­tion changes, post recov­ery checks, open service to users; pre-requisites and depend­en­cies for each activ­ity; respons­ib­il­it­ies, who will perform each activity.

Should you include the detailed activ­it­ies for installing an oper­at­ing system or restor­ing a data­base? The detailed recov­ery activ­it­ies should be held locally by the team respons­ible for per­form­ing these activ­it­ies. There are several reasons for this, the “how do I install Windows” instruc­tions will be used for busi­ness as usual activ­it­ies, minor incid­ents, and dis­aster recov­ery. The DR Plan only needs to ref­er­ence these doc­u­ments, if you find it an abso­lute neces­sity to include these in your DR Plan then do so as an appen­dices and not in the main body of the doc­u­ment, don’t allow key purpose of the DR Plan to be lost in un-necessary or duplic­ated detail.

Testing the Dis­aster Recov­ery Plan

IT DR Testing should be per­formed on a regular basis, the exact fre­quency very much depends on your own organ­isa­tional needs. However, it is usual for “full deploy­ment” tests should be per­formed, as a minimum, on an annual basis. There are of course other “trigger points”, for example, a change in your infra­struc­ture that affects your dis­aster recov­ery strategy, i.e. moving from active/contingency recov­ery model to an active/passive recov­ery model.

What do I test?

This ques­tion is prob­ably the most common ques­tion asked, and the answer is simple, you test the plans, the process, the people, and the infra­struc­ture, in fact every com­pon­ent required to recov­ery and resume your crit­ical IT services.

What are the key object­ives of a DR test? There are several key object­ives of a test, the main ones are:

Exer­cise the recov­ery pro­cesses and pro­ced­ures famil­i­ar­ise staff with the recov­ery process and doc­u­ment­a­tion; verify the effect­ive­ness of the recov­ery doc­u­ment­a­tion; verify the effect­ive­ness of the recov­ery site; estab­lish if the recov­ery object­ives are achiev­able; identify improve­ments require to the DR strategy, infra­struc­ture, and recov­ery processes;

What is the scope of a DR test?

The scope will very much depend on the matur­ity of your DR strategy and cap­ab­il­ity, it is import­ant to scope the test to stretch the object­ives and success cri­teria of the pre­vi­ous test, for example, if this is your first test, you may not want to have the entire user com­munity sched­uled to come in and perform lots of testing, you may wish to limit the scope to just IT staff and maybe a couple of “friendly users” to test func­tion­al­ity. Depend­ing on the com­plex­ity of your envir­on­ment it may take several tests to build con­fid­ence and perform a “full deploy­ment” test.

Common DR testing mis­takes are:

Oper­at­ing within your comfort zone, for example, recov­er­ing the servers you know you can recov­ery whilst avoid­ing the more dif­fi­cult components

Not testing the recov­ery of a service but focus­ing on the hard­ware, systems, and applic­a­tions. Remem­ber, a par­tic­u­lar service may require several servers to be recovered, it may also require data held on local drives and network attached devices, and network con­nectiv­ity from the data centre to the user. trying to achieve too much too soon and over­stat­ing your DR cap­ab­il­ity and readiness

Not plan­ning appro­pri­ately, testing and live invoc­a­tion are very dif­fer­ent. In a live invoc­a­tion you do not have a live envir­on­ment to protect. Con­sider the impact that testing may have on your live services.

Engage with the appro­pri­ate people at an early stage, a “full deploy­ment” test may take several weeks to plan.

No comments: