Rick Verstegen, Technology Professional
Troubleshooting VMware vSAN with ESXCLI
ESXCLI is a command line interface (CLI) framework in VMware vSphere that provides a modular architecture for various components called namespaces running in the VMkernel. Some of these namespaces are, for example, network for vSwitches and storage for vSphere core storage components.
ESXCLI commands are very useful for configuring and troubleshooting ESXi servers. In addition, some configurations can only be configured using ESXCLI! For me, it was very helpful to use ESXCLI at a customer to fix an issue with the VMware vSAN cluster configuration.
At my current customer’s organization, small sites are configured with a VMware vSAN 2 node ROBO solution. The two nodes are directly connected via a 10GbE direct/cross connection and use an external witness node for arbitration.
With the ability to directly connect the vSAN data network across hosts and send witness traffic down an alternate route, there is no requirement for an ethernet fabric switch for the vSAN data network in this design.
A vSAN ROBO solution allows a lower total cost of the infrastructure. For the customer, this is a significant cost saving because intentions are deploying vSAN 2 node at large scale.
Problem and solution
After configuring vSAN on the cluster we received the error “Host cannot communicate with one or more other nodes in the vSAN enabled cluster
” in the vSphere webclient. Despite configuring each node correctly regarding VMkernel ports for vSAN, witness and management, we noticed we only got half of the capacity from the vSAN datastore.
Time to troubleshoot the configuration. The first step was to verify if the nodes could communicate over the VMkernel ports with vSAN traffic enabled. We used vmkping on each node to check if the nodes could reach each other. We also tested if the witness host was reachable from both nodes and vice versa. Our checks succeeded, so communication between the VMkernel ports was OK!
The next step was to check the vSAN cluster configuration in the vSphere webclient. We were immediately triggered by the networking mode in which vSAN traffic is operating. It was configured as multicast.
Since version 6.6, vSAN relies on unicast traffic.
All nodes in the cluster are based on the latest VMware vSAN version (6.6), which uses unicast traffic. vSAN versions prior to version 6.6 relied on multicast. So, multicast is not correct in this situation! It must be unicast traffic.
We needed to use ESXCLI on the nodes to dive deeper into the networking mode issue for the vSAN cluster.
To obtain more information about the vSAN cluster we first ran the command esxcli vsan cluster get.
We only saw one node and the external witness node as member of the cluster because the value of Sub-Cluster Member Count
was 2. We expected to see node A, node B and the external witness which makes three in total. So, one node was missing.
One node was not participating in the cluster, but which one? Therefore, we needed to find out the Local Node UUID of that node. On node A and node B we ran the command esxcli vsan cluster get
to extract both local Node UUIDs. The Local Node UUIDs are required to configure the unicast agents.
The following screenshot shows the Local Node UUID of node B.
Because the vSAN cluster is operating in multicast networking mode (which is not correct), we needed to verify the unicast agents itself. It is also very important that all nodes in the cluster have a consistent configuration.
You can verify this by running the esxcli vsan cluster unicastagent list
command to display all the unicast agents in the vSAN cluster. You need to check this on all nodes that are participating in the vSAN cluster. We executed the command on node A. It displayed all the nodes in the vSAN cluster except for node A, because this is the node were the command is performed on. The expected result displayed should be node B and the witness node. As you can see it only displays the witness node.
Together with VMware support we looked into the problem and they confirmed it was indeed in the unicast agents. But how do we fix this? We needed to add the unicast agents manually.
To add a unicast agent to the vSAN cluster we used the command:
esxcli vsan cluster unicastagent add –a <ip address unicast agent> –U <supports unicast> –u <Local UUID> -t < type>
For this command we need the Local UUID of the node which we noted previously. To add node B as unicast agent on node A we execute the following command:
esxcli vsan cluster unicast agent add –a 10.89.16.2 -U true –u 59c0ea35-e377-224c-b18e-70106f9e5570 –t node
We need to do the same on node B:
esxcli vsan cluster unicast agent add –a 10.89.16.1 -U true –u 59bf6016-623d-d0d4-e7cb-70106f9e55d0 –t node
When we run the command esxcli vsan cluster unicastagent list
again we now see all the unicast agents in the vSAN cluster.
On node A
On node B
Now we can verify if Sub-Cluster Member Count
is also updated to value 3. After running esxcli vsan cluster get
it displays the correct count.
So, you should expect the networking mode in the vSphere webclient is now set to unicast, right?
Unfortunately, it is still displaying multicast. However, the error “Host cannot communicate with one or more other nodes in the vSAN enabled cluster
” has disappeared. It looks like the vSAN Health Check did not pick up the changes, so we restarted the vmware-vsan-health
service on the vCenter Appliance. After the service restart, networking mode was displaying unicast!
We looked into the vSAN Health and all built-in tests results are successful.
The vSAN cluster is now running normally without errors.
As you can see ESXCLI still plays an important role for troubleshooting. ESXCLI is a powerful command line interface for troubleshooting and configuration. Therefore, every vSphere admin should know about this and use it on regular basis in their vSphere environment.
For more information on specific commands, see the vSphere CLI Reference