- Clearly defined organizational responsibilities.  Roles and responsibilities is a major area where organizations fall  short with regard to disaster recovery. The DR process is much more than  restoring or replicating data; it's about ensuring that the  applications and systems they support can be returned to functional  business usage. Accomplishing this requires participation from groups  outside of IT, including corporate governance and oversight groups,  finance groups and the business units impacted. 
 - Validate the business  impact analysis (BIA) process. Technically, the BIA isn't part  of the disaster recovery process -- it's a prerequisite that forms the  foundation of DR planning. In a perfect world, the output of a business  impact analysis would define the kinds of recovery capabilities IT must  design and deliver in support of the business. The real world,  unfortunately, isn't so simple. Information is often incomplete, and we  need to make assumptions to fill in the gaps. 
 - Define and tier application recovery services. When  business executives hear IT people talking about disaster recovery  strategy, they're thinking cost. With DR comes insurance, and because no  one wants to spend too much on insurance, efficiency is vital. While  there are significant fixed costs inherent to DR -- a recovery site, for  example -- there are also a substantial number of variable costs that  can be controlled. The key is to realize that not every application  requires a two-hour recovery time. Establishing a catalog of services  based on business impact analysis requirements that provide several  levels of recovery, and then aligning applications appropriately is one  way to contain costs. With multilevel recovery services, applications  can be prioritized according to importance. Among the business  attributes that should be defined within the service catalog are risk  (usually expressed in terms of recovery time objective and recovery  point objective), quality of service (including performance and  consistency levels) and cost.  
 - Implement a comprehensive cost model. While the  business impact analysis determines the impact of downtime to a line of  business, and tiered recovery services provide a catalog of services  that align with business requirements, there also needs to be a method  to determine and allocate the cost of those services. Corporate  governance may help set thresholds for recovery and imply minimum levels  of protection, but the service level is greatly influenced by cost. The  cost model should calculate the per-unit total cost of ownership that  would be charged to the business for any given service offering. Among  the items included in such a cost model are personnel, facilities,  hardware and software, maintenance and support. Having this data  available helps significantly in aligning "want" with "need," and is a  critical success factor in delivering these services efficiently.  
 - Design an effective disaster recovery infrastructure.  The disaster recovery infrastructure must support the business impact  analysis requirements and service-level targets. While disaster recovery  is an extension of operational recovery capability, factors such as  distance and bandwidth also come into play. The good news is that the  number of remote recovery options available to architects and designers  has increased dramatically over the past few years. Traditional storage  mirroring and replication are more broadly available on a wide range of  systems, and compression and deduplication technologies can reduce  bandwidth requirements. In addition, technologies like server  virtualization can dramatically improve remote recoverability.  
 - Select the right target recovery site. Disaster  recovery site selection often presents a challenge. Organizations with  multiple data centers can develop cross-site recovery capabilities; if  you don't have that option, selecting a DR site can easily become the  biggest challenge in getting disaster recovery off the ground.   Key concerns include the levels of protection needed, and whether to own  or outsource  disaster recovery (and to what degree). The two chief, and often  competing, factors to consider are risk and convenience. Planning for  protection against a regional disaster means that many DR sites get  pushed far away from headquarters, where most of the IT staff is housed.  Service recovery levels will determine whether the site is a "hot,"  "warm" or "cold" site. This is a critical designation because there's a  substantial difference in the fixed cost of each. Generally, recovery  time objectives (RTOs) of less than a day require a hot site. The  question of outsourcing depends on the desired degree of control,  guarantees of infrastructure availability at a given location and, of  course, cost. 
 - Establish mature operational disciplines. Some  people point out that one of the best ways to improve disaster recovery  is to improve production. Put another way, if normal day-to-day  operations don't tend to function well, neither will your disaster  recovery plan. Therefore, operational discipline is an essential element  of predictable DR. The first sign of a potential operational deficiency  is the lack of documentation for key processes. Given that disaster  recovery, by definition, occurs under seriously sub-optimal conditions,  the need for well-documented standard operating procedures is clear.  Organizations that have established and actively embraced standard  frameworks, like the Information Technology Infrastructure Library  (ITIL), are significantly improving their odds of recoverability in the  chaotic atmosphere of a disaster situation.  
 - Develop a realistic testing methodology. Given the operational disruption, practical difficulties and costs involved, we tend to focus our testing on those components that are easy to test. But realistic testing is just that -- testing real business function recovery. While it's necessary to perform component testing on a regular basis, it's equally important to test the recoverability of large-scale functions to ensure that interoperability and interdependency issues are consistently addressed. The closer to a real production environment a test can get, the more "provable" the DR capability.
 
The elements outlined here transcend the boundaries of the IT infrastructure. It's therefore critical for IT administrators to have a strong understanding of the problems at hand and to learn how to address them so they can influence strategic disaster recovery decision-making wherever possible. This will help them avoid being placed in the Catch-22 situation of solving a problem, over which they have no control.
No comments:
Post a Comment