Disaster Recovery is primarily a form of long distance
state replication combined with the ability to start up applications
at the backup site after a failure is detected.
The amount and type of state that is sent to the backup
site can vary depending on the application’s needs. State
replication can be done at one of these layers: (i) within
an application, (ii) per disk or within a file system, or
(iii) for the full system context. Replication at the application
layer can be the most optimized, only transferring
the crucial state of a specific application. For example,
some high-end database systems replicate state by transferring
only the database transaction logs, which can be
more efficient than sending the full state modified by each
query [8]. Backup mechanisms operating at the file system
or disk layer replicate all or a portion of the file system
tree to the remote site without requiring specific application
knowledge [6]. The use of virtualization makes
it possible to not only transparently replicate the complete
disk, but also the memory context of a virtual machine,
allowing it to seamlessly resume operation after a
failure; however, such techniques are typically designed
only for LAN environments due to significant bandwidth
and latency requirements [4, 9].
The level of data protection and speed of recovery depends
on the type of backup mechanism used and the nature
of resources available at the backup site. In general,
DR services fall under one of the following categories:
Hot Backup Site: A hot backup site typically provides
a set of mirrored stand-by servers that are always available
to run the application once a disaster occurs, providing
minimal RTO and RPO. Hot standbys typically use
synchronous replication to prevent any data loss due to a
disaster. This form of backup is the most expensive since
fully powered servers must be available at all times to
run the application, plus extra licensing fees may apply
for some applications. It can also have the largest impact
on normal application performance since network latency
between the two sites increases response times.
Warm Backup Site: A warm backup site may keep
state up to date with either synchronous or asynchronous
replication schemes depending on the necessary RPO.
Standby servers to run the application after failure are
available, but are only kept in a “warm” state where it
may take minutes to bring them online. This slows recovery,
but also reduces cost; the server resources to run
the application need to be available at all times, but active
costs such as electricity and network bandwidth are
lower during normal operation.
Cold Backup Site: In a cold backup site, data is often
only replicated on a periodic basis, leading to an RPO
of hours or days. In addition, servers to run the application
after failure are not readily available, and there may
be a delay of hours or days as hardware is brought out
of storage or repurposed from test and development systems,
resulting in a high RTO. It can be difficult to support
business continuity with cold backup sites, but they
are a very low cost option for applications that do not
require strong protection or availability guarantees.
The on-demand nature of cloud computing means that
it provides the greatest cost benefit when peak resource
demands are much higher than average case demands.
This suggests that cloud platforms can provide the greatest
benefit to DR services that require warm stand-by
replicas. In this case, the cloud can be used to cheaply
maintain the state of an application using low cost resources
under ordinary operating conditions. Only after
a disaster occurs must a cloud based DR service pay for
the more powerful–and expensive–resources required to
run the full application, and it can add these resources
in a matter of seconds or minutes. In contrast, an enterprise
using its own private resources for DR must always
have servers available to meet the resource needs of the
full disaster case, resulting in a much higher cost during
normal operation.
2.3 Failover and Failback
In addition to managing state replication, a DR solution
must be able to detect when a disaster has occurred, perform
a failover procedure to activate the backup site, as
well as run the failback steps necessary to revert control
back to the primary data center once the disaster
has been dealt with. Detecting when a disaster has occurred
is a challenging problem since transient failures
or network segmentation can trigger false alarms. In
practice, most DR techniques rely on manual detection
and failover mechanisms. Cloud based systems can simplify
this problem by monitoring the primary data center
from cloud nodes distributed across different geographic
regions, making it simpler to determine the extent of a
network failure and react accordingly.
In most cases, a disaster will eventually pass, and a
business will want to revert control of its applications
back to the original site. To do this, the DR software
must support bidirectional state replication so that any
new data that was created at the backup site during the
disaster can be transferred back to the primary. Doing
this efficiently can be a major challenge: the primary site
may have lost an arbitrary amount of data due to the disaster,
so the replication software must be able to determine
what new and old state must be resynchronized to
the original site. In addition, the failback procedure must
be scheduled and implemented in order to minimize the
level of application downtime.
No comments:
Post a Comment