I recently had a discussion regarding the setup of a secondary datacenter to provide business continuity to an existing infrastructure using stretched storage. This in itself is an interesting topic with many architectural outcomes which I will not get into on this post but lets for the sake of argument say we decide to create active active data centers with EMC VPLEX Metro.
The other side of the coin is how do you manage these active active data centers by balancing management overhead and resiliency?
vSphere Metro Storage Cluster (sMCS) and SRM 6.1, which supports stretched storage, are the two solutions I am going to review. There are already a whole bunch of articles out there on this but some of them mainly focused on pre 6.1. These are just my notes and views and if you have a different viewpoint or it needs some tweaking please let me know, I cherish all feedback.
Why use stretched clusters:
- Disaster avoidance or site maintenance without downtime is important.
- Non-disruptive migration of workloads between the active-active datacenters.
- When availability is one of your top priorities. Depending on the failure scenario you have more outcomes where your VMs will not be impacted by outages from network, storage or host chassis failure at a site.
- When your datacenters has network links which do no exceed 5 milliseconds round trip response time.
- Redundant network links are highly recommended.
- When you require multi-site load balancing of your workloads.
vPlex requirements:
- The maximum round trip latency on both the IP network and the inter-cluster network between the two VPLEX clusters must not exceed 5 milliseconds round-trip.
- For management and vMotion traffic, the ESXi hosts in both data centers must have a private network on the same IP subnet and broadcast domain. Preferably management and vMotion traffic are on separate networks.
- Stretched layer 2 network, meaning the networks where the VMs reside on needs to be available/accessible from both sites.
- The data storage locations, including the boot device used by the virtual machines, must be active and accessible from ESXi hosts in both data centers.
- vCenter Server must be able to connect to ESXi hosts in both data centers.
- The VMware datastore for the virtual machines running in the ESX Cluster are provisioned on Distributed Virtual Volumes.
- The maximum number of hosts in the HA cluster must not exceed 32 hosts for 5.x and 64 hosts for 6.0.
- The configuration option auto-resume for VPLEX Cross-Connect consistency groups must be set to true.
- Enabling FT on the virtual machines is supported except for Cluster Witness Servers.
- This configuration is supported on both VS2 and VS6 hardware for VPLEX 6.0 and later releases.
VMSC:
a VMSC infrastructure is a stretched cluster that enables continues availability across sites, including support for:
- vSphere vMotion
- HA
- DRS
- FT over distance
- Storage failure protection
VMSC requirements:
- Single vCenter server
- Cluster with DRS and HA enabled
- Regular vCenter server requirements apply here
VMSC positives:
- Continuous Availability
- Fully Automatic Recovery
- VMware HA (near zero RTO)
- Automated Load Balancing
- DRS and Instant vMotion
- vMSC using VPLEX Metro
- Certified Since vSphere 5.0
- Behaves just like a single vSphere cluster
VMSC negatives:
- Major architectural and operational considerations for HA and DRS configurations. This is especially true for highly customized environments with rapid changes in configuration. Some configuration change examples:
- Admission control
- Host affinity rules to make sure that VMs talk to local storage
- Datastore heartbeat
- Management address heartbeat and 2 additional IPs
- Change control for when workloads are migrated to different sites, rules would need to be updated.
- Double the amount of resources required. When you buy one, well you need to buy a second! This is important since you have keep enough resources available on each site to satisfy the resources requirements for HA failover since all VMs are restarted within the cluster.
- Recommended to set Admission control to 50%
- No orchestration of powering on VMs after HA restart.
- HA will attempt to start virtual machines with the categorization of High, Medium or Low. The difficulty here is, if critical systems must start first before other systems that are dependent on those virtual machines, there is no means by which VMware HA can control this start order more affectively or handle alternate workflows or run books that handle different scenarios for failure.
- Single vCenter server
- Failure of the site where vCenter resides disrupts management of both sites. Look out for development on this shortcoming in vSphere 6.5
SRM 6.1 with stretched storage:
Site Recovery Manager 6.1 adds support for stretched storage solutions over a metro distance from several major storage partners, and integration with cross-vCenter vMotion when using these solutions as replication. This allows companies to achieve application mobility without incurring in downtime, while taking advantage of all the benefits that Site Recovery Manager delivers, including centralized recovery plans, non-disruptive testing and automated orchestration.
Adding stretched storage to a Site Recovery Manager deployment fundamentally reduces recovery times.
- In the case of a disaster, recovery is much faster due to the nature of the stretched storage architecture that enables synchronous data writes and reads on both sites.
- In the case of a planned migration, such as in the need for disaster avoidance, data center consolidation and site maintenance using stretched storage enables zero-downtime application mobility. When using stretched storage, Site Recovery Manager can orchestrate cross-vCenter vMotion operations at scale, using recovery plans. This is what enables application mobility, without incurring in any downtime
SRM requirements:
-
Storage policy protection groups in enhanced linked mode
- External PSCs for enhanced linked mode requirement
-
Supported compatible storage arrays and SRAs
- vCenter server each site
- Windows server each site for SRM application and SRA.
SRM positives:
- Provides orchestrated and complex reactive recovery solution
- For instance a 3 tiered application which has dependancies on specific services/servers to power on first.
- Provides consistent, repeatable and testable RTOs
- DR compliance shown through audit trails and repeatable processes.
- Disaster Avoidance (Planned)
- Manually Initiate with SRM
- Uses vMotion across vCenters for VMs
- Disaster Recovery (Unplanned)
- Manually Initiate Recovery Plan Orchestration
- SRM Management Resiliency
- VMware SRM 6.1 + VPLEX Metro 5.5
- Stretched Storage with new VPLEX SRA
- Separate failure domains, different vSphere Clusters
SRM negatives:
- No Continuous Availability
- No HA, DRS or FT across sites
- No SRM “Test” Recovery plan due to stretched storage
- Have to make use of planned migration to “test” but just be aware that your VMs associated to protection group will migrate live to second site.
Questions to ask:
At the end I really think it all comes down to a couple questions you can ask to make the decision easier. SRM has narrows the gap on some of the features that VMSC provides so these questions are based on the remaining differences between each solution.
- Do you have complex tiered applications with dependancies on other applications like for instance databases?
- Do you have a highly customized environment which incurs rapid changes?
- Do you require DR compliance with audit trails and repeatable processes?
Pick SRM!
- Do you require a “hands off” fast automated failover?
- Do you have non-complex applications without any dependancies and do not care on how these power on during failover?
- Do you want to have your workloads automatically balanced across different sites?
Pick vMSC!
Links:
https://kb.vmware.com/kb/2007545
http://pubs.vmware.com/Release_Notes/en/srm/61/srm-releasenotes-6-1-0.html