I ran into an interesting problem today on my distributed (enterprise) vRA 7.2 environment and wanted to share how I got it resolved.
I have not deployed anything in my environment for a while but when I tried today my request was not completing and status is showing “In Progress”
Troubleshooting:
Review logs:
- Infrastructure -> Monitoring -> Audit Logs
- Machine requests shows that is was started
- Infrastructure -> Monitoring -> Log
- Found error on my manager services nodes “[EventBrokerService] Failed resuming workflow.. State VMPSMasterWorkflow32.Requested(POST). Event
Event Queue operation failed with MessageQueueErrorCode QueueNotFound for queue ’30da8a16-c532-4e13-bd81-39b09114a887′.”
- Found error on my manager services nodes “[EventBrokerService] Failed resuming workflow.. State VMPSMasterWorkflow32.Requested(POST). Event
- Logged into Service manager nodes and review the logs in Event Viewer
- Found error “Error occurred while registering the DEM.
System.Data.Services.Client.DataServiceTransportException: The underlying connection was closed: An unexpected error occurred on a send. —> System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a send. —> System.IO.IOException: Authentication failed because the remote party has closed the transport stream”
- Found error “Error occurred while registering the DEM.
- Logged into the Web server nodes and review the logs in Event Viewer
- Found similar error as above
- Found error messages like “Error occurred writing to the repository tracking log”, “Error occurred while pinging repository”
Review DEM status:
- Infrastructure -> Monitoring -> DEM status
- both my DEM worker and Orchestrator shows with Status Active (Green)
Resolution:
I did some investigation and found really 2 problems that I needed to address
- If you find errors like “Event Queue operation failed with MessageQueueErrorCode QueueNotFound for queue” then you probably have manager service running on both instances (nodes).
- If you find errors like “System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it” then the problem is most likely with certificates and found in the vRA documentation that if you have commas in your OU section of the IaaS certificate, that your VM provisioning might fail and the following work around is provided
- Remove the commas from the OU section of the IaaS certificate, OR
- Change the polling method from WebSocket to HTTP to resolve the issues.
- Open the Manager Service configuration file in a text editor.
- C:\:Program Files (x86)\VMware\vCAC\Server\Manager Service.exe.config.
- Add the following lines to <appSettings>
- <add key=”Extensibility.Client.RetrievalMethod” value=”Polling”/>
<add key=”Extensibility.Client.PollingInterval” value=”2000″/>
<add key=”Extensibility.Client.PollingMaxEvents” value=”128″/> - Restart the manager services
Some other things to verify:
- On Web server Windows OS nodes
-
- Verify that the VMware Cloud Automation Center Management agent services is running
-
- On Manager service Window OS nodes
- Verify that the VMware Cloud Automation Center Service is running
- This should only be running on 1 server if have a load balancer in front.
- Set the Startup type to Manual on the 2nd server so you don’t have worry about this service starting but remember you have to failover manually by changing the service to automatic and starting the service.
- In vRA 7.3 the failover process is now automated which is great!
- Verify that the VMware Cloud Automation Center Management agent services is running on your instances
- Verify that the VMware Cloud Automation Center Service is running
- On DEM server Windows OS nodes
- Verify that your VMware vCloud Automation Center Agent and Management agent services is running
- Most people do not know this but VMware also has a very cool vRealize production test tool which I will blog about shortly.
Links: