Stretched storage using SRM or vMSC?

I recently had a discussion regarding the setup of a secondary datacenter to provide business continuity to an existing infrastructure using stretched storage.  This in itself is an interesting topic with many architectural outcomes which I will not get into on this post but lets for the sake of argument say we decide to create active active data centers with EMC VPLEX Metro. 

The other side of the coin is how do you manage these active active data centers by balancing management overhead and resiliency?

vSphere Metro Storage Cluster (sMCS) and SRM 6.1, which supports stretched storage, are the two solutions I am going to review. There are already a whole bunch of articles out there on this but some of them mainly focused on pre 6.1.  These are just my notes and views and if you have a different viewpoint or it needs some tweaking please let me know, I cherish all feedback.

Why use stretched clusters:

  • Disaster avoidance or site maintenance without downtime is important.
    • Non-disruptive migration of workloads between the active-active datacenters.
  • When availability is one of your top priorities.  Depending on the failure scenario you have more outcomes where your VMs will not be impacted by outages from network, storage or host chassis failure at a site.
  • When your datacenters has network links which do no exceed 5 milliseconds round trip response time.
    • Redundant network links are highly recommended.
  • When you require multi-site load balancing of your workloads.


vPlex requirements: 

  • The maximum round trip latency on both the IP network and the inter-cluster network between the two VPLEX clusters must not exceed 5 milliseconds round-trip.
  • For management and vMotion traffic, the ESXi hosts in both data centers must have a private network on the same IP subnet and broadcast domain. Preferably management and vMotion traffic are on separate networks.
  • Stretched layer 2 network, meaning the networks where the VMs reside on needs to be available/accessible from both sites.
  • The data storage locations, including the boot device used by the virtual machines, must be active and accessible from ESXi hosts in both data centers.
  • vCenter Server must be able to connect to ESXi hosts in both data centers.
  • The VMware datastore for the virtual machines running in the ESX Cluster are provisioned on Distributed Virtual Volumes.
  • The maximum number of hosts in the HA cluster must not exceed 32 hosts for 5.x and 64 hosts for 6.0.
  • The configuration option auto-resume for VPLEX Cross-Connect consistency groups must be set to true.
  • Enabling FT on the virtual machines is supported except for Cluster Witness Servers.
  • This configuration is supported on both VS2 and VS6 hardware for VPLEX 6.0 and later releases.



a VMSC infrastructure is a stretched cluster that enables continues availability across sites, including support for:

  • vSphere vMotion
  • HA
  • DRS
  • FT over distance
  • Storage failure protection

VMSC requirements:

  • Single vCenter server
  • Cluster with DRS and HA enabled
  • Regular vCenter server requirements apply here

VMSC positives:

  • Continuous Availability
  • Fully Automatic Recovery
    • VMware HA (near zero RTO)
  • Automated Load Balancing
    • DRS and Instant vMotion
  • vMSC using VPLEX Metro
    • Certified Since vSphere 5.0
  • Behaves just like a single vSphere cluster

VMSC negatives:

  • Major architectural and operational considerations for HA and DRS configurations. This is especially true for highly customized environments with rapid changes in configuration.  Some configuration change examples:
    • Admission control
    • Host affinity rules to make sure that VMs talk to local storage
    • Datastore heartbeat
    • Management address heartbeat and 2 additional IPs
    • Change control for when workloads are migrated to different sites, rules would need to be updated.
  • Double the amount of resources required.  When you buy one, well you need to buy a second!  This is important since you have keep enough resources available on each site to satisfy the resources requirements for HA failover since all VMs are restarted within the cluster.
    • Recommended to set Admission control to 50%
  • No orchestration of powering on VMs after HA restart.
    • HA will attempt to start virtual machines with the categorization of High, Medium or Low. The difficulty here is, if critical systems must start first before other systems that are dependent on those virtual machines, there is no means by which VMware HA can control this start order more affectively or handle alternate workflows or run books that handle different scenarios for failure.
  • Single vCenter server
    • Failure of the site where vCenter resides disrupts management of both sites. Look out for development on this shortcoming in vSphere 6.5


SRM 6.1 with stretched storage:

Site Recovery Manager 6.1 adds support for stretched storage solutions over a metro distance from several major storage partners, and integration with cross-vCenter vMotion when using these solutions as replication. This allows companies to achieve application mobility without incurring in downtime, while taking advantage of all the benefits that Site Recovery Manager delivers, including centralized recovery plans, non-disruptive testing and automated orchestration.

Adding stretched storage to a Site Recovery Manager deployment fundamentally reduces recovery times.

  • In the case of a disaster, recovery is much faster due to the nature of the stretched storage architecture that enables synchronous data writes and reads on both sites.
  • In the case of a planned migration, such as in the need for disaster avoidance, data center consolidation and site maintenance using stretched storage enables zero-downtime application mobility. When using stretched storage, Site Recovery Manager can orchestrate cross-vCenter vMotion operations at scale, using recovery plans. This is what enables application mobility, without incurring in any downtime

SRM requirements:

  • Storage policy protection groups in enhanced linked mode

  • External PSCs for enhanced linked mode requirement
  • Supported compatible storage arrays and SRAs

  • vCenter server each site
  • Windows server each site for SRM application and SRA.

SRM positives:

  • Provides orchestrated and complex reactive recovery solution
    • For instance a 3 tiered application which has dependancies on specific services/servers to power on first.
  • Provides consistent, repeatable and testable RTOs
  • DR compliance shown through audit trails and repeatable processes.
  • Disaster Avoidance (Planned)
    • Manually Initiate  with SRM
    • Uses vMotion across vCenters for VMs
  • Disaster Recovery (Unplanned)
    • Manually Initiate Recovery Plan Orchestration
    • SRM Management Resiliency
  • VMware SRM 6.1 + VPLEX Metro 5.5
    • Stretched Storage with new VPLEX SRA
    • Separate failure domains, different vSphere Clusters

SRM negatives:

  • No Continuous Availability
  • No HA, DRS or FT across sites
  • No SRM “Test” Recovery plan due to stretched storage
    • Have to make use of planned migration to “test” but just be aware that your VMs associated to protection group will migrate live to second site.


Questions to ask:

At the end I really think it all comes down to a couple questions you can ask to make the decision easier.  SRM has narrows the gap on some of the features that VMSC provides so these questions are based on the remaining differences between each solution.

  1. Do you have complex tiered applications with dependancies on other applications like for instance databases?
  2. Do you have a highly customized environment which incurs rapid changes?
  3. Do you require DR compliance with audit trails and repeatable processes?

Pick SRM!

  1. Do you require a “hands off” fast automated failover?
  2. Do you have non-complex applications without any dependancies and do not care on how these power on during failover?
  3. Do you want to have your workloads automatically balanced across different sites?

Pick vMSC!



EMC UnityVSA with SRM configuration

I am not going to get into the details of setting up SRM and ECM Unity this is very well documented so the information I will provide is after SRM is installed and configured on vCenter and EMC Unity is installed and configured.

Previous blog post shows UnityVSA setup:


I already have my pools and LUN’s configured on both Unity virtual storage appliances.
Firstly we want to setup an interface for replication on both Unity VSA’s.
In Unisphere select Data protection -> Replication
Select Interfaces
Click + sign

Select Ethernet Port and provide IP address information.

click OK

Now lets configure the remote connections between Unity arrays.
In Unisphere select Data protection -> Replication
Select Connections
Click + sign

Enter Replication connection information for your remote Unity VSA.
Asynchronous is the only supported method for the Unity VSA.

Click OK.
Select the remote system and click “Verify and Update” to make sure everything is working correctly.

Now lets go ahead and setup the Consistency groups.
In Unisphere select Storage -> Block
Select Consistency Groups
Click + sign

Provide name

Configure your LUN’s.  You have to create a minimum on 1 LUN but you can later add your existing LUN’s to this consistency group if that is required.

Click + to Configure access

Add initiators

Create Snapshot schedule

Specify replication mode and RPO

Specify destination

Click Finish

Now that we have replication configured we can go to vCenter and configure SRM.

I already have my EMC Unity Block SRA installed on my SRM server. My mappings is also configured within each site so we will skip this.

Open vCenter server and select Site recovery.
Select each site -> Monitor -> SRA’s
Select rescan all SRA’s
Verify that EMC Unity Block SRA is available.

Let’s configure Array Base Replication.
Select Site recovery
Select Inventories -> Double click Array Base Replication
Select “Add array manager”
On popup wizard select “Add a pair or array managers”

Select location

Select Storage replication adapter, EMC Unity Block SRA

Configure Array manager

Configure array manager pair for secondary site.

Enable the pairs

Click Finish

Verify Status is OK

Click on each storage array and verify no errors and that you can see the local devices being replicated.

Now we can setup the protection group
Select Site recovery
Select Inventories -> Protection Groups
Select “Create Protection group”
Enter name

Select protection group direction and type. For this we will select array base replication with datastore groups.

Select datastore groups

This will provide information on the VM’s which will be protected.

Click Finish
Verify protection status is OK

Finally you can configured your Replication plan:
Select Site recovery
Select Inventories -> Recovery Plans
Select “Create Recovery plan”
Enter name

Select recovery site
Select protection group

Select network to be used for running tests of the plan.
Click Finish

You can now test your recovery plan.

SRM 5.8: Synchronize storage freezes at 90%

SRM 5.8 with storage array replication VNX mirrorview.

Run a recovery and once completed run reprotect.
During the reprotect the storage synchronization gets stuck at 90%.

No real information from SRM on the status or errors so had to do some digging.  

On the storage array reviewed the replicated LUN for the specific recovery plan and found that the the secondary image was showing “waiting for administrator to start synchronization”.

By default SRM queries an ongoing synchronization every 30 seconds to report status so after selecting synchronization and its completion did the SRM status also update and completed.

This setting is adjustable in the SRM advanced settings per site:  storage.querySyncStatusPollingInterval.