Stretched storage using SRM or vMSC?

I recently had a discussion regarding the setup of a secondary datacenter to provide business continuity to an existing infrastructure using stretched storage.  This in itself is an interesting topic with many architectural outcomes which I will not get into on this post but lets for the sake of argument say we decide to create active active data centers with EMC VPLEX Metro. 

The other side of the coin is how do you manage these active active data centers by balancing management overhead and resiliency?

vSphere Metro Storage Cluster (sMCS) and SRM 6.1, which supports stretched storage, are the two solutions I am going to review. There are already a whole bunch of articles out there on this but some of them mainly focused on pre 6.1.  These are just my notes and views and if you have a different viewpoint or it needs some tweaking please let me know, I cherish all feedback.

Why use stretched clusters:

  • Disaster avoidance or site maintenance without downtime is important.
    • Non-disruptive migration of workloads between the active-active datacenters.
  • When availability is one of your top priorities.  Depending on the failure scenario you have more outcomes where your VMs will not be impacted by outages from network, storage or host chassis failure at a site.
  • When your datacenters has network links which do no exceed 5 milliseconds round trip response time.
    • Redundant network links are highly recommended.
  • When you require multi-site load balancing of your workloads.

 

vPlex requirements: 

  • The maximum round trip latency on both the IP network and the inter-cluster network between the two VPLEX clusters must not exceed 5 milliseconds round-trip.
  • For management and vMotion traffic, the ESXi hosts in both data centers must have a private network on the same IP subnet and broadcast domain. Preferably management and vMotion traffic are on separate networks.
  • Stretched layer 2 network, meaning the networks where the VMs reside on needs to be available/accessible from both sites.
  • The data storage locations, including the boot device used by the virtual machines, must be active and accessible from ESXi hosts in both data centers.
  • vCenter Server must be able to connect to ESXi hosts in both data centers.
  • The VMware datastore for the virtual machines running in the ESX Cluster are provisioned on Distributed Virtual Volumes.
  • The maximum number of hosts in the HA cluster must not exceed 32 hosts for 5.x and 64 hosts for 6.0.
  • The configuration option auto-resume for VPLEX Cross-Connect consistency groups must be set to true.
  • Enabling FT on the virtual machines is supported except for Cluster Witness Servers.
  • This configuration is supported on both VS2 and VS6 hardware for VPLEX 6.0 and later releases.

 

VMSC:

a VMSC infrastructure is a stretched cluster that enables continues availability across sites, including support for:

  • vSphere vMotion
  • HA
  • DRS
  • FT over distance
  • Storage failure protection

VMSC requirements:

  • Single vCenter server
  • Cluster with DRS and HA enabled
  • Regular vCenter server requirements apply here

VMSC positives:

  • Continuous Availability
  • Fully Automatic Recovery
    • VMware HA (near zero RTO)
  • Automated Load Balancing
    • DRS and Instant vMotion
  • vMSC using VPLEX Metro
    • Certified Since vSphere 5.0
  • Behaves just like a single vSphere cluster

VMSC negatives:

  • Major architectural and operational considerations for HA and DRS configurations. This is especially true for highly customized environments with rapid changes in configuration.  Some configuration change examples:
    • Admission control
    • Host affinity rules to make sure that VMs talk to local storage
    • Datastore heartbeat
    • Management address heartbeat and 2 additional IPs
    • Change control for when workloads are migrated to different sites, rules would need to be updated.
  • Double the amount of resources required.  When you buy one, well you need to buy a second!  This is important since you have keep enough resources available on each site to satisfy the resources requirements for HA failover since all VMs are restarted within the cluster.
    • Recommended to set Admission control to 50%
  • No orchestration of powering on VMs after HA restart.
    • HA will attempt to start virtual machines with the categorization of High, Medium or Low. The difficulty here is, if critical systems must start first before other systems that are dependent on those virtual machines, there is no means by which VMware HA can control this start order more affectively or handle alternate workflows or run books that handle different scenarios for failure.
  • Single vCenter server
    • Failure of the site where vCenter resides disrupts management of both sites. Look out for development on this shortcoming in vSphere 6.5

 

SRM 6.1 with stretched storage:

Site Recovery Manager 6.1 adds support for stretched storage solutions over a metro distance from several major storage partners, and integration with cross-vCenter vMotion when using these solutions as replication. This allows companies to achieve application mobility without incurring in downtime, while taking advantage of all the benefits that Site Recovery Manager delivers, including centralized recovery plans, non-disruptive testing and automated orchestration.

Adding stretched storage to a Site Recovery Manager deployment fundamentally reduces recovery times.

  • In the case of a disaster, recovery is much faster due to the nature of the stretched storage architecture that enables synchronous data writes and reads on both sites.
  • In the case of a planned migration, such as in the need for disaster avoidance, data center consolidation and site maintenance using stretched storage enables zero-downtime application mobility. When using stretched storage, Site Recovery Manager can orchestrate cross-vCenter vMotion operations at scale, using recovery plans. This is what enables application mobility, without incurring in any downtime

SRM requirements:

  • Storage policy protection groups in enhanced linked mode

  • External PSCs for enhanced linked mode requirement
  • Supported compatible storage arrays and SRAs

  • vCenter server each site
  • Windows server each site for SRM application and SRA.

SRM positives:

  • Provides orchestrated and complex reactive recovery solution
    • For instance a 3 tiered application which has dependancies on specific services/servers to power on first.
  • Provides consistent, repeatable and testable RTOs
  • DR compliance shown through audit trails and repeatable processes.
  • Disaster Avoidance (Planned)
    • Manually Initiate  with SRM
    • Uses vMotion across vCenters for VMs
  • Disaster Recovery (Unplanned)
    • Manually Initiate Recovery Plan Orchestration
    • SRM Management Resiliency
  • VMware SRM 6.1 + VPLEX Metro 5.5
    • Stretched Storage with new VPLEX SRA
    • Separate failure domains, different vSphere Clusters

SRM negatives:

  • No Continuous Availability
  • No HA, DRS or FT across sites
  • No SRM “Test” Recovery plan due to stretched storage
    • Have to make use of planned migration to “test” but just be aware that your VMs associated to protection group will migrate live to second site.

 

Questions to ask:

At the end I really think it all comes down to a couple questions you can ask to make the decision easier.  SRM has narrows the gap on some of the features that VMSC provides so these questions are based on the remaining differences between each solution.

  1. Do you have complex tiered applications with dependancies on other applications like for instance databases?
  2. Do you have a highly customized environment which incurs rapid changes?
  3. Do you require DR compliance with audit trails and repeatable processes?

Pick SRM!

  1. Do you require a “hands off” fast automated failover?
  2. Do you have non-complex applications without any dependancies and do not care on how these power on during failover?
  3. Do you want to have your workloads automatically balanced across different sites?

Pick vMSC!

 

Links:

http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/vmware-vsphere-metro-storage-cluster-recommended-practices-white-paper.pdf

https://kb.vmware.com/kb/2007545

http://pubs.vmware.com/Release_Notes/en/srm/61/srm-releasenotes-6-1-0.html

VSAN : Migrate VSAN cluster from vSS to vDS

How to migrate a VSAN cluster from vSS to vDS
I am sure there some of you that is currently a VSAN cluster in some shape or form either in POC, Development or Production environment.  It provides a cost effective solution that is great for remote offices or even management clusters and can be implemented and managed very easily but as the saying goes nothing ever come easy and you have to work for it.  The same goes here and there are a lot of prerequisites for a VSAN environment that is crucial for implementing a healthy system that performs to its full potential.  I will not go into much detail here and feel free to contact us if any services are required.
One of the recommendations for VSAN is to use a vDS and your VSAN license actually includes the ability to use vDS which allows you as our customer to take advantage of simplified network management regardless of the underlying vSphere edition.
If you upgrade from vSS to vDS the steps are a bit different that your normal migration.  I recommend you put the host into maintenance mode with ensure accessibility.  Verify the uplink used for VSAN VMkernel and use the manage physical network adapter to remove the vmnic from vSS and to add it to vDS. Now migrate the VMkernel to the VDS.  If you review the VSAN health the network will show failed.
To verify multicast network traffic is flowing from you host use the following command on the ESXi host using bash shell:
#tcpdump-uw -i vmk2 -n -s0 -t -c 20 udp port 23451 or ump port 12345
To review your multicast network settings
#esxcli vsan network list
Ruby vSphere Console (RVC) is also a great tool to have in your arsenal for managing VSAN and following command can be used to review the VSAN state:
vsan.check_state
To re-establish the network connection you can use the following command:
vsan.reapply_vsan_vmknic_config
Rerun the VSAN health test and verify Network shows passed.

Now that the VSAN network is up and running you can migrate the rest of VMkernels.

VSAN – on-disk upgrade error "Failed to realign following Virtual SAN objects"

I upgraded the ESXi hosts from 6.0 GA to 6.0U2 and selected upgrade for VSAN On-disk format Version, however this failed with following error message:

“Failed to realign following Virtual SAN objects:  XXXXX, due to object locked or lack of vmdk descriptor file, which requires manual fix”

I reviews the VSAN health log files at following location:
/storage/log/vmware/vsan-health/vmware-vsan-health-service.log

Grep realigned

Grep Failed

I was aware of this issue due to previous blog posts on same problem and new of KB 2144881 which made the task of cleaning objects with missing descriptor files much easier.

I ran the script: python VsanRealign.py -l /tmp/vsanrealign.log precheck.

I however received another alert and the python script did not behave as it should with it indicating a swap file had either multiple reverences or was not found.

I then used RVC to review the object info for the UUID in question.

I used RVC again to try and purge any inaccessible swap files:
vsan.purge_inaccessible_vswp_objects ~cluster

no objects was found.

I then proceeded to review the vmx files for the problem VM in question and found reference to only the original *.vswp file and not with additional extension of *.vswp.41796

Every VM on VSAN has 3 swap files:
vmx-servername*.vswp
servername*.vswp
sername*.vswp.lck

I figured this servername*.vswp.41796 is just a leftover file and bear no reference to the VM and this is what is causing the on-disk upgrade to fail.

I proceeded to move the file to my /tmp directory  (Please be very careful with delete/moving any files within a VM folder, this is done at your own risk and if you are not sure I highly recommend you contact VMware support for assistance)

I ran the python realign script again. This time I received a prompt to perform the autofix actions to remove this same object in question for which i selected yes.


I ran the on-disk upgrade again and it succeeded.

Even though VMware provides a great python script that will in most instance help you clean up the VSAN disk groups, there are times when this will not work as planned and then you just have to a bit more troubleshooting and perhaps a phone call to GSS.

links:
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144881

VSAN – cache disk unavailable when creating disk group on Dell

I ran into an issue at customer where the SSD which is to be used as the cache disk on the VSAN disk group was showing up as regular HDD.  However when I reviewed the storage device the disk is visible and is marked as flash…weird.  So what is going on here.

As I found out this due to a flash device being used with a controller that does not support JBOD.

To fix this I had to create a RAID 0 virtual disk for the SSD.  If you have a Dell controller this means you have to set the mode to RAID but make sure that all your regular HDDs to be used in the disk group is set to non-raid!  Once host is back online you have to go and mark the SSD drive as flash.  This is the little “F” icon in the disk devices.

This environment was configured with all the necessary VSAN prerequisites for Dell in place, you can review this on the following blog post:
http://virtualrealization.blogspot.com/2016/07/vsan-and-dell-poweredge-servers.html

Steps to setup RAID-0 on SSD through lifecycle controller:

  1. Lifecycle Controller
  2. System Setup
  3. Advanced hardware configuration
  4. device settings
  5. Select controller (PERC)
  6. Physical disk management
  7. Select SSD
  8. From drop down select “convert to Raid capable”
  9. Go back to home screen
  10. Select hardware configuration
  11. Configuration wizard
  12. Select RAID configuration
  13. Select controller
  14. Select Disk to convert from HBA to RAID (if required)
  15. Select RAID-0
  16. Select Physical disks (SSD in this case)
  17. Select Disk attribute and name Virtual Disk.
  18. Finish
  19. Reboot
After ESXi host is online again then you have to change the Disk to flash. This is due to RAID abstracting away most of the physical device characteristics and the media type as well.

  • Select ESXi host 
  • Manage -> Storage -> Storage adapters
  • Select vmhba0 from PERC controller
  • Select the SSD disk
  • Click on the “F” icon above.

VSAN – Changing Dell Controller from RAID to HBA mode

So had to recently make some changes for customer to set the PERC controller to HBA (non-raid), since previously it was configured with RAID mode and all disks was in RAID 0 virtual disks.  Each disk group consists of 5 disks with 1 x SSD and 4 x HDD.

I cannot overstate this but make sure you have all the firmware and drivers up to date which is provided in the HCL.

Here are some prerequisites for moving from RAID to HBA mode:  I am not going to get into details for performing these tasks.

  • All virtual disks must be removed or deleted.
  • Hot spare disks must be removed or re-purposed.
  • All foreign configurations must be cleared or removed.
  • All physical disks in a failed state, must be removed.
  • Any local security key associated with SEDs must be deleted.

I followed these steps:

  1. Put host into maintenance mode with full data migration. Have to select full data migration since we will be deleting the disk group.
    1. This process can be monitored in RVC using command vsan.resync_dashboard ~cluster
  2. Delete the VSAN disk group on the host in maintenance.
  3. Use the virtual console on iDRAC and select boot next time into lifecycle controller
  4. Reboot the host
  5. From LifeCycle Controller main menu
  6. System Setup
  7. Advanced hardware configuration
  8. Device Settings
  9. Select controller card
  10. Select Controller management
  11. Scroll down and select Advanced controller management
  12. Set Disk Cache for Non-RAID to Disable
  13. Set Non RAID Disk Mode to Enabled

VSAN – migrate VSAN cluster to new vCenter Server

I had to recently perform a VSAN cluster migration from one vCenter Server to another. This sounds like a daunting task but ended up being very simple and straight forward due to VSAN’s architecture to not have a reliance on vCenter Server for its normal operation(nice on VMware!) As a bonus the VMs does not need to be powered off or loose any connectivity.(bonus!)

Steps to perform:
  • Deploy a new vCenter Server and create a vSphere Cluster
  • Enable VSAN on the cluster.
  • Install VSAN license and associate to cluster
  • Disconnect one of the ESXi hosts from your existing VSAN Cluster
  • Add previously disconnected Host to the new VSAN Cluster on your new vCenter Server.
    • You will get a warning within the VSAN Configuration page stating there is a “Misconfiguration detected”. this is normal due to the ESXi not being able to communicate with the other hosts in the cluster it was configured with.
  • Add the rest of the ESXi hosts.
  • After all the ESXi are added back the warning should disappear.
links:

VSAN upgrade – Dell Poweredge servers

I have been meaning to write up on a VSAN upgrade on a Dell R730xd’s with PERC H730 which I recently completed at a customer.  This is not going to be lengthy discussion on this topic but primarily want to provide some information on tasks I had to perform for upgrade to VSAN 6.2

  1. The VSAN on-disk metadata upgrade is equivalent to doing a SAN array firmware upgrade and therefore requires a good backup and recovery strategy to be in place before you proceed.
  2. Migrate VM’s off of host.
  3. Place host into maintenance mode.
    1. You want to use whatever the quickest method is to update the firmware, for VSAN’s sake. Normally Dell FTP update if network available to configure.
    2. When you put a host into maintenance mode and choose the option to “ensure accessibility”, it doesn’t migrate all the components off but just enough so that the policies will be in violation.  A timer starts when you power it off, and if the host isn’t back in the VSAN cluster after 60 minutes, it begins to rebuild that host’s data elsewhere in the cluster  If you know it will take longer than 60min or where possible select full data migration.
    3. You can view the resync using the RVC command “vsan.resync_dashboard “
  1. Change advanced settings required for PERC H730
    1. https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144936
    2. esxcfg-advcfg -s 100000 /LSOM/diskIoTimeout
    3. esxcfg-advcfg -s 4 /LSOM/diskIoRetryFactor
  2. Upgrade the lsi_mr3 driver. VUM is easy!
  3. Login to DRAC and perform firmware upgrade:
  4. Upgrade Backplane expander (BP13G+EXP 0:1)
    1. Firmware version 1.09 ->  3.03
  5. Upgrade DRAC H730 version
      1. 25.3.0.0016 ->  25.4.0.0017
  1. Login to lifecycle controller and set/verify BIOS configuration settings for controller
    1. https://elgwhoppo.com/2015/08/27/how-to-configure-perc-h730-raid-cards-for-vmware-vsan/
    2. Disk cache for non-raid = disabled
    3. Bios mode = pause on errors
    4. Controller mode = HBA (non-raid)
  2. After all hosts upgraded, verify VSAN cluster functionality and other prerequisites:
    1. Verify no stranded objects on VSAN datastores by running python script on each host.
    2. Verify persistent log storage for VSAN trace files.
    3. Verify advanced settings still set from task 3!
  3. Place each host into maintenance mode again.
  4. Upgrade ESXi host to 6.0U2.
  5. Upgrade the on-disk format to V3.
    1. This task runs for a very long time and has alot of sub-steps which takes place in the background.  It also migrates the data off of each disk group to recreate as V3 .  This has not impact on the VMs.
    2. This process is repeated for all disk groups.
  6. Verify all disk groups upgrade to V3.
  7. Completed

Ran into some serious trouble and had a resync task that ran for over a week due to a VSAN 6.0 KB 2141386 which appears on  heavy utilization storage utilization.  Only way to fix this was to put host into maintenance mode with full data migration, destroy and recreate the disk group.

Also ALWAYS check the VMware HCL to make sure your firmware is compatible. I can never say this enough since it is super important.

This particular VSAN 6.0 was running with outdated firmware for both backplane and PERC H730. Also found that controller was set to RAID for disks in stead of non-raid (passthrough or HBA mode).

Links:

VMware as a kick@ass KB on best practices for Dell PERC H730 for VSAN implementation. Link  provide below.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2109665

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144614

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144936


https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2141386

VSAN upgrade – prerequisites

This is just my list of prerequisites for a VSAN upgrade.  Most of these items are applicable to a new install as well but more focused on a upgrade.

Please feel free to provide feedback and would to add to my list with your experiences.

Prerequisites:

  • The VSAN on-disk metadata upgrade is equivalent to doing a SAN array firmware upgrade and therefore requires a good backup and recovery strategy to be in place before we can proceed.


Links:



EMC UnityVSA with SRM configuration

I am not going to get into the details of setting up SRM and ECM Unity this is very well documented so the information I will provide is after SRM is installed and configured on vCenter and EMC Unity is installed and configured.

Previous blog post shows UnityVSA setup:
https://virtualrealization.blogspot.com/2016/05/how-to-emc-unityvsa-installation-and.html

EMC UnityVSA:

I already have my pools and LUN’s configured on both Unity virtual storage appliances.
Firstly we want to setup an interface for replication on both Unity VSA’s.
In Unisphere select Data protection -> Replication
Select Interfaces
Click + sign

Select Ethernet Port and provide IP address information.

click OK

Now lets configure the remote connections between Unity arrays.
In Unisphere select Data protection -> Replication
Select Connections
Click + sign

Enter Replication connection information for your remote Unity VSA.
Asynchronous is the only supported method for the Unity VSA.

Click OK.
Select the remote system and click “Verify and Update” to make sure everything is working correctly.

Now lets go ahead and setup the Consistency groups.
In Unisphere select Storage -> Block
Select Consistency Groups
Click + sign

Provide name

Configure your LUN’s.  You have to create a minimum on 1 LUN but you can later add your existing LUN’s to this consistency group if that is required.

Click + to Configure access

Add initiators

Create Snapshot schedule

Specify replication mode and RPO

Specify destination

Click Finish

Now that we have replication configured we can go to vCenter and configure SRM.

SRM:
I already have my EMC Unity Block SRA installed on my SRM server. My mappings is also configured within each site so we will skip this.

Open vCenter server and select Site recovery.
Select each site -> Monitor -> SRA’s
Select rescan all SRA’s
Verify that EMC Unity Block SRA is available.

Let’s configure Array Base Replication.
Select Site recovery
Select Inventories -> Double click Array Base Replication
Select “Add array manager”
On popup wizard select “Add a pair or array managers”

Select location

Select Storage replication adapter, EMC Unity Block SRA

Configure Array manager

Configure array manager pair for secondary site.

Enable the pairs

Click Finish

Verify Status is OK

Click on each storage array and verify no errors and that you can see the local devices being replicated.

Now we can setup the protection group
Select Site recovery
Select Inventories -> Protection Groups
Select “Create Protection group”
Enter name

Select protection group direction and type. For this we will select array base replication with datastore groups.

Select datastore groups

This will provide information on the VM’s which will be protected.

Click Finish
Verify protection status is OK

Finally you can configured your Replication plan:
Select Site recovery
Select Inventories -> Recovery Plans
Select “Create Recovery plan”
Enter name


Select recovery site
Select protection group

Select network to be used for running tests of the plan.
Click Finish

You can now test your recovery plan.