Monday, July 19, 2010

Challenges for the Cloud Provider (Disaster Recovery)

Handling Correlated Failures

Typically a cloud provider will attempt to statistically
multiplex its DR customers onto its server pool. Such
statistical multiplexing assumes that not all of its customers
will experience simultaneous failures, and hence
the number of free servers that the cloud providers must
have available is smaller than the peak needs of all its customers.
However, correlated failures across customers is
not uncommon—for instance, an electric grid failure or
a natural disaster such as a flood can cause a large number
of customer from a geographic area to simultaneously
failover to their DR sites. To prevent such correlated
failures from stressing any one data center, the cloud
provider should attempt to distribute its DR customers
across multiple data centers in a way that minimizes potential
conflicts—e.g. multiple customers from the same
geographic region should be backed up to different cloud
data centers. This placement problem is further complicated
by constraints such as limits on latency between
the customer and cloud site. To intelligently address this
issue, the cloud provider must employ risk models—not
unlike ones used by insurance companies—to (i) estimate
how many servers should be available in a data center for
a certain group of customers and (ii) how to distribute
customers from a region across different data center sites
to “spread the risk”. In the event of stress on any single
data center due to correlated failures, dynamic migration
of a group of customers to another site can be employed.
To achieve all of these tasks seamlessly, the cloud
provider should be able to treat all of its data centers
as a single pool of resources available to its DR customers
[10, 2]. In practice, current data centers act as isolated
entities and it is non-trivial to move or replicate storage
and computation resources between data centers. We
believe that future cloud architectures will rely on network
virtualization to provide seamless connectivity between
data centers, and wide-areaVMand storage migration
to allow for resource management across data center
sites.


Revenue Maximization
The DR strategies we have discussed assume that customers
only pay for the majority of their DR resources
after some kind of failure actually occurs, and that sufficient
resources are always available when needed. The
cloud service provider must maintain these resources and
pay for their upkeep at all times, regardless of whether
a customer has experienced a failure. Since disasters are
typically rare, there will be little or no revenue from the
server farm in the normal case when there are no failures.
Hence, a cloud provider must find ways to generate
revenue from such idling resources in order to make its
capital investments viable.
We assume that a cloud DR provider will also offer traditional
cloud computing services and rent its resources
to customers for non-DR purposes. In this case, the cloud
may be able to “double book” its servers for both regular
and DR customers. Public clouds generally only offer
best effort service when new VM or network resources
are requested. While this is sufficient for general cloud
computing, in disaster recovery it is imperative that additional
resources be available within the specified RTO.
One existing pricing mechanism that would facilitate this
on demand access to resources is the use of “spot instances”.
Spot instances allow the service provider to
rent resources, typically at a lower price, without guarantees
about how long they will be available. A cloud
service could generate revenue from idling DR servers by
offering them as spot instances to non-DR customers and
reclaim them on-demand when these servers are needed
for high priority DR customers.
Currently, cloud platforms often provide few guarantees
about server and bandwidth availability and network
quality of service, which are important for ensuring an application can fully operate after failover. EC2 currently
supports “reserved” VM instances that are guaranteed to
be available, but they are primarily designed for users
who know that they will be actively running a VM for
a long period of time, and their pricing is designed to reflect
this with a moderate yearly fee but cheaper hourly
costs. For disaster recovery, it may be desirable to allow
for “priority resources” which are guaranteed to be
available on demand, although perhaps at a higher hourly
cost than ordinary VM instances or network bandwidth
(which also increases the revenue for the cloud provider
while providing better assurances to a customer).

No comments: