Monday, July 19, 2010

Disaster Recovery Requirements

This section discusses the key requirements for an effective
DR service. Some of these requirements may be
based on business decisions such as the monetary cost of
system downtime or data loss, while others are directly
tied to application performance and correctness.

Recovery Point Objective (RPO): The RPO of a DR
system represents the point in time of the most recent
backup prior to any failure. The necessary RPO is generally
a business decision—for some applications absolutely
no data can be lost (RPO=0), requiring continuous
synchronous replication to be used, while for other applications,
the acceptable data loss could range from a few
seconds to hours or even days.

Recovery Time Objective (RTO): The RTO is an orthogonal
business decision that specifies a bound on how
long it can take for an application to come back online
after a failure occurs. This includes the time to detect the
failure, prepare any required servers in the backup site
(virtual or physical), initialize the failed application, and
perform the network reconfiguration required to reroute
requests from the original site to the backup site so the
application can be used. Depending on the application
type and backup technique, this may involve additional
manual steps such as verifying the integrity of state or
performing application specific data restore operations,
and can require careful scheduling of recovery tasks to
be done efficiently [7]. Having a very low RTO can enable
business continuity, allowing an application to seamlessly
continue operating despite a site wide disaster.
Performance: For a DR service to be useful it must
have a minimal impact on the performance of each application
being protected under failure-free operation. DR
can impact performance either directly such as in the synchronous
replication case where an application write will
not return until it is committed remotely, or indirectly by
simply consuming disk and network bandwidth resources
which otherwise the application could use.

Consistency: The DR service must ensure that after a
failure occurs the application can be restored to a consistent
state. This may require the DR mechanism to
be application specific to ensure that all relevant state is
properly replicated to the backup site. In other cases, the
DR system may assume that the application will keep a
consistent copy of its important state on disk, and use a
disk replication scheme to create consistent copies at the
backup site.


Geographic Separation: It is important that the primary
and backup sites are geographically separated in order
to ensure that a single disaster will not impact both
sites. This geographic separation adds its own challenges
since increased distance leads to higher WAN bandwidth
costs and will incur greater network latency. Increased
round trip latency directly impacts application response
time when using synchronous replication. As round trip
delays are limited by the speed of light, synchronous
replication is feasible only when the backup site is within
10s of kilometers of the primary. Asynchronous techniques
can improve performance over longer distances,
but can lead to greater data loss during a disaster. Distance
can especially be a challenge in cloud based DR
services as a business might have only coarse control over
where resources will be physically located.

No comments: