Modern cloud offerings (especially from well-established providers) should offer at least 3 levels of availability strategy:

  • Capabilities “built into” the pricing for base system function.
  • Capabilities available at a single geography for availability management.
  • Multi-geography strategic capabilities for robust availability management.
[This post is the second of 3 parts.  Part 1 addresses the business context that drives cloud DR, Part 2 (today) will discuss assessing the platform, and Part 3 concludes with a case to look at how we might proceed in a simple customer example.] Is this about Availability or Disaster Recovery?!
But, wait!  Are we talking about Disaster Recovery or Availability?  The answer is “yes”!  In cloud architectures, almost always these are two sides of the same set of features.  An architect needs to look at the three capabilities above as a baseline of availability but also consider their utility in the event of a disaster.  Remember that one of the things we want to look for is re-use!
Reuse is not only about can I take something I’ve already done (method, pattern, framework, system) and use it in a different way to meet another requirement in the system, but in this case the way that we may approach availability strategy can offer us the ability to recover from the loss of information, a portion of the infrastructure, an incomplete step in the application, or other failures.
Outside of availability, the architect should be looking at how the cloud provider is executing on traditional recovery strategies – like backups.  Hint: Without addition of a backup service or intentional implementation of something on the customer’s end, many of today’s cloud providers are not doing active backups, per-se, often relying on multi-live-copy, or live-replication features from above to replace that capability!
Each solution needs to be technically assessed both at a holistic level and at an application-component level for its relationship to “platform based” availability strategies.
At the level of making specific architectural decisions for how to align and implement application components to platform capabilities, the architect needs to assess how each layer of functionality is aligned to availability, recovery, and the types of threats it does – and does not – mitigate.  For example, VM management between multiple hosts that is done automatically in a single site would probably be part of “capabilities built into the base system function”.
There may be a feature which would permit a second VM at the application level, fronted by a load balancer, or perhaps there is a feature available that would permit rapid expansion and contraction on-demand for your instance at a single geography within a broader availability group to protect against demand-based inaccessibility.  Finally, you could consider features that allow multiple site deployment of resources, tied together through a geo-IP based directional service (like a CDN content provider or geo-DNS solution).  If one entire site were to go offline, you could potentially trust the provider to direct incoming requests to other sites.
All of these capabilities have cost implications, and the architect must be positioned to consider the way that vertical and horizontal deployment of capability affects the solution against cost and maintainability constraints.  Does the application component – at any given level of solution – have the intelligence to accept or account for platform capabilities?  If not, you may find yourself further constrained technically in which route is a supported route to provide availability and recovery for your overall solution.
To all of these considerations which have implications across availability and disaster recovery in provisioning and operating the infrastructure, the architect should also consider whether a recovery strategy that includes capabilities outside of the entire provider scope are necessary.  Does the business trust the provider (do YOU trust the provider) platform and guarantee to a level that obviates the need for a more formal backup and restore capability – possibly including export to a customer-controlled (“on-premise”) storage capability?
Part of the business  assessment for the solution should help identify whether the additional investment is needed, alongside the cloud strategy, and to what level of investment the backup capability should be explored.
The holistic consideration of the base compute platform, DR implications of the availability technology, and the consideration of DR-specific features are all key roles of the architect team to consider the value of and find the right combination of features that meet the business needs, conform to constraints, and – where necessary – confirm to larger business policies.
To be continued in part three, applying the concepts in a case study.
Wayne Anderson (@NoCo_Architect) is an Infrastructure Managed Services Architect with Avanade, a company that helps customers realize results in a digital world through business technology solutions and managed services that combine insight, innovation and expertise focused on Microsoft® technologies. He has completed more than 30 Microsoft certifications in his career alongside credentials from CompTIA and other industry vendors. Mr. Anderson’s past roles include management of global certification with Avanade, as well as focus in information security and architecture.