Hope you are doing all great, for today’s post I wanted to put together some of the commands/troubleshooting I’ve had used with VMware vSAN,
Identify a partitioned node from a VSAN Cluster (Hosts)
What a single partitioned node looks like:
~ # esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2020-10-25T10:35:19Z
Local Node UUID: 507e7bd5-ad2f-6424-66cb-1cc1de253de4
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 507e7bd5-ad2f-6424-66cb-1cc1de253de4
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 52e4fbe6-7fe4-9e44-f9eb-c2fc1da77631
Sub-Cluster Membership Entry Revision: 7
Sub-Cluster Member UUIDs: 507e7bd5-ad2f-6424-66cb-1cc1de253de4
Sub-Cluster Membership UUID: ba45d050-2e84-c490-845f-1cc1de253de4
~ #
What a full 4-node cluster looks like (no partition)
~ # esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2020-10-25T10:35:19Z
Local Node UUID: 54188e3a-84fd-9a38-23ba-001b21168828
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 54188e3a-84fd-9a38-23ba-001b21168828
Sub-Cluster Backup UUID: 545ca9af-ff4b-fc84-dcee-001f29595f9f
Sub-Cluster UUID: 529ccbe4-81d2-89bc-7a70-a9c69bd23a19
Sub-Cluster Membership Entry Revision: 3
Sub-Cluster Member UUIDs: 54188e3a-84fd-9a38-23ba-001b21168828, 545ca9af-ff4bfc84-dcee-001f29595f9f, 5460b129-4084-7550-46e1-0010185def78, 54196e13-7f5f-cba8-
5bac-001517a69c72
Sub-Cluster Membership UUID: 80757454-2e11-d20f-3fb0-001b21168828
Check if there are ongoing/stuck resync operations (hosts)
while true;do echo " ****************************************** " ; echo "" > /tmp/resyncStats.txt ;cmmds-tool find -t DOM_OBJECT -f json |grep uuid |awk -F \" '{print $4}' |while read i;do pendingResync=$(cmmds-tool find -t DOM_OBJECT -f json -u $i|grep -o "\"bytesToSync\": [0-9]*,"|awk -F " |," '{sum+=$2} END{print sum / 1024 / 1024 / 1024;}');if [ ${#pendingResync} -ne 1 ]; then echo "$i: $pendingResync GiB";fi;done |tee -a /tmp/resyncStats.txt;total=$(cat /tmp/resyncStats.txt |awk '{sum+=$2} END{print sum}');echo "Total: $total GiB" |tee -aa /tmp/resyncStats.txt;total=$(cat /tmp/resyncStats.txt |grep Total);totalObj=$(cat /tmp/resyncStats.txt|grep -vE " 0 GiB|Total"|wc -l);echo "`date +%Y-%m-%dT%H:%M:%SZ` $total ($totalObj objects)" >> /tmp/totalHistory.txt; echo `date ` ; sleep 60 ;done
No resync operations
Total: 0 GiB
Sun Oct 25 00:30:29 UTC 2020
Pending resync operation:
ba8d625b-6457-b3ac-6da9-e4434b016608: 212.125 GiB
Total: 212.125 GiB
Sun Oct 25 00:43:29 UTC 2029
Check the state of the components (Hosts)
cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c
Good state:
772 state\": 7 --healthy
Good state:
2 state\": 13 --inaccessible objects
6 state\": 15 --absent/degraded
Find the UUID of the healhty (state 7), inaccessible objects (state 13) and absent/degraded (stage 15). Replace the stage number to filter out the results per state, for example, here are the two inaccessible objects:
cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep 'state\\\": 13' -B 1 | grep uuid | cut -d "\"" -f4
7328b659-9d15-cb40-68eb-a81e846157fb
e52bb359-d0e0-3ce2-7070-a81e846153d9
If you find objects on state 13 or 15, you better contact VMware ASAP to see if they can be recovered.
Disk Balance Test (Hosts)
esxcli vsan health cluster get -t "vSAN Disk Balance"
A balanced scenario
vSAN Disk Balance green
Overview
Metric Value
---------------------------------
Average Disk Usage 10 %
Maximum Disk Usage 12 %
Maximum Variance 4 %
LM Balance Index 2 %
Disk Balance
Host Device Rebalance State Data To Move
------------------------------------------------------------------------------------------------------------------------
A not balanced scenario
vSAN Disk Balance yellow
Overview
Metric Value
---------------------------------
Average Disk Usage 27 %
Maximum Disk Usage 45 %
Maximum Variance 37 %
LM Balance Index 19 %
Disk Balance
Host Device Rebalance State Data To Move
------------------------------------------------------------------------------------------------------------------------
172.16.11.246 Local TOSHIBA Disk (naa.50000398b821a7c5) Proactive rebalance is in progress 420.5244 GB
Verify which devices from a particular host are in which disk groups (Hosts)
[root@esxi-1:~] vdq -iH
Good mapping
Mappings:
DiskMapping[0]:
SSD: naa.58ce38ee2016ffe5
MD: naa.5002538a4819e540
MD: naa.5002538a4819e510
MD: naa.5002538a4819e3e0
DiskMapping[2]:
SSD: naa.58ce38ee2016fe55
MD: naa.5002538a48199ca0
MD: naa.5002538a48199e20
MD: naa.5002538a48199e00
Something is not right… (missing disks)
Mappings:
DiskMapping[0]:
SSD: naa.58ce38ee2016ffe5
MD: naa.5002538a4819e3e0
DiskMapping[2]:
SSD: naa.58ce38ee2016fe55
MD: naa.5002538a48199ca0
MD: naa.5002538a48199e20
MD: naa.5002538a48199e00
Find devices that are in PDL state (Hosts)
Run the command vdq –qH and search for IsPDL? if the value is equal to 1 the device is in PDL state.
[root@esxi-04:~] vdq -qH
DiskResults:
DiskResult[0]:
Name: naa.600508b1001c4b820b4d80f9f8acfa95
VSANUUID: 5294bbd8-67c4-c545-3952-7711e365f7fa
State: In-use for VSAN
ChecksumSupport: 0
Reason: Non-local disk
IsSSD?: 0
IsCapacityFlash?: 0
IsPDL?: 0
<<truncated>>
DiskResult[18]:
Name:
VSANUUID: 5227c17e-ec64-de76-c10e-c272102beba7
State: In-use for VSAN
ChecksumSupport: 0
Reason: None
IsSSD?: 0
IsCapacityFlash?: 0
IsPDL?: 1 = Device is in PDL state
Determine if there are enough reources remaining in the cluster to rebuild by simulating a host failure scenario (RVC – vCenter)
RVC command vsan.whatif_host_failures helps to know if there are the sufficient amount of resources by simulating a host failure, make sure to connect to your vCenter-RVC.
/localhost/virtuallyvtrue/computers> vsan.whatif_host_failures 0
Simulating 1 host failures:
+-----------------+-----------------------------+-----------------------------------+
| Resource | Usage right now | Usage after failure/re-protection |
+-----------------+-----------------------------+-----------------------------------+
| HDD capacity | 5% used (3635.62 GB free) | 7% used (2680.12 GB free) |
| Components | 3% used (11687 available) | 3% used (8687 available) |
| RC reservations | 0% used (3142.27 GB free) | 0% used (2356.71 GB free) |
+-----------------+-----------------------------+-----------------------------------+
/localhost/virtuallyvtrue/computers>
Testing Virtual SAN functionality – deploying VMs – (RVC – vCenter)
/localhost/virtuallyvtrue/computers> diagnostics.vm_create --
datastore ../datastores/vsanDatastore --vm-folder ../vms/Discovered\ virtual\ machine 0
Creating one VM per host ... (timeout = 180 sec)
Success
/localhost/virtuallyvtrue/computers>
Start and Stop proactive rebalances (RVC – vCenter)
/vcsa-01.virtuallyvtrue.local/virtuallyvtrue/computers> vsan.proactive_rebalance_info 0
2020-10-25 14:14:27 +0000: Retrieving proactive rebalance information from host esxi-02.virtuallyvtrue.local ...
2020-10-25 14:14:27 +0000: Retrieving proactive rebalance information from host esxi-04.virtuallyvtrue.local ...
2020-10-25 14:14:27 +0000: Retrieving proactive rebalance information from host esxi-01.virtuallyvtrue.local ...
2020-10-25 14:14:27 +0000: Retrieving proactive rebalance information from host esxi-03.virtuallyvtrue.local ...
Proactive rebalance is not running!
Max usage difference triggering rebalancing: 30.00%
Average disk usage: 5.00%
Maximum disk usage: 26.00% (21.00% above mean)
Imbalance index: 5.00%
No disk detected to be rebalanced
Start the proactive rebalance - vsan.proactive_rebalance -s 0
/vcsa-01.virtuallyvtrue.local/virtuallyvtrue/computers> vsan.proactive_rebalance -s 0
2020-10-25 14:15:05 +0000: Processing Virtual SAN proactive rebalance on host esxi-02.virtuallyvtrue.local ...
2020-10-25 14:15:05 +0000: Processing Virtual SAN proactive rebalance on host esxi-04.virtuallyvtrue.local ...
2020-10-25 14:15:05 +0000: Processing Virtual SAN proactive rebalance on host esxi-01.virtuallyvtrue.local ...
2020-10-25 14:15:05 +0000: Processing Virtual SAN proactive rebalance on host esxi-03.virtuallyvtrue.local ...
Proactive rebalance has been started!
/vcsa-01.virtuallyvtrue.local/virtuallyvtrue/computers> vsan.proactive_rebalance_info 0
2020-10-25 14:15:11 +0000: Retrieving proactive rebalance information from host esxi-02.virtuallyvtrue.local ...
2020-10-25 14:15:11 +0000: Retrieving proactive rebalance information from host esxi-04.virtuallyvtrue.local ...
2020-10-25 14:15:11 +0000: Retrieving proactive rebalance information from host esxi-01.virtuallyvtrue.local ...
2020-10-25 14:15:11 +0000: Retrieving proactive rebalance information from host esxi-03.virtuallyvtrue.local ...
Proactive rebalance start: 2020-10-25 14:13:10 UTC
Proactive rebalance stop: 2020-10-25 14:16:17 UTC
Max usage difference triggering rebalancing: 30.00%
Average disk usage: 5.00%
Maximum disk usage: 26.00% (21.00% above mean)
Imbalance index: 5.00%
No disk detected to be rebalanced
Stop the proactive rebalance – vsan.proactive_rebalance -o 0
/vcsa-01.virtuallyvtrue.local/virtuallyvtrue/computers> vsan.proactive_rebalance -o 0
2020-10-25 14:15:45 +0000: Processing Virtual SAN proactive rebalance on host esxi-01.virtuallyvtrue.local ...
2020-10-25 14:15:45 +0000: Processing Virtual SAN proactive rebalance on host esxi-02.virtuallyvtrue.local ...
2020-10-25 14:15:45 +0000: Processing Virtual SAN proactive rebalance on host esxi-04.virtuallyvtrue.local ....
2020-10-25 14:15:45 +0000: Processing Virtual SAN proactive rebalance on host esxi-03.virtuallyvtrue.local ...
Proactive rebalance has been stopped!
Verify the health of the Virtual SAN Cluster by their objects’ states (RVC – vCenter)
/vcsa-01.virtuallyvtrue.local/virtuallyvtrue/computers> vsan.obj_status_report 0
2020-10-25 12:47:04 +0000: Querying all VMs on VSAN ...
2020-10-25 12:47:04 +0000: Querying all objects in the system from esxi-02.virtuallyvtrue.local ...
2020-10-25 12:47:04 +0000: Querying all disks in the system from esxi-02.virtuallyvtrue.local ...
2020-10-25 12:47:05 +0000: Querying all components in the system from esxi-02.virtuallyvtrue.local ...
2020-10-25 12:47:05 +0000: Querying all object versions in the system ...
2020-10-25 12:47:06 +0000: Got all the info, computing table ...
Histogram of component health for non-orphaned objects
+-------------------------------------+------------------------------+
| Num Healthy Comps / Total Num Comps | Num objects with such status |
+-------------------------------------+------------------------------+
| 11/11 (OK) | 30 |
| 3/3 (OK) | 6 |
+-------------------------------------+------------------------------+
Total non-orphans: 36
Histogram of component health for possibly orphaned objects
+-------------------------------------+------------------------------+
| Num Healthy Comps / Total Num Comps | Num objects with such status |
+-------------------------------------+------------------------------+
+-------------------------------------+------------------------------+
Total orphans: 0
Total v1 objects: 0
Total v2 objects: 0
Total v2.5 objects: 0
Total v3 objects: 0
Total v5 objects: 36
Check if the unicast agents are able to see all nodes (Hosts)
[root@esxi-03:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ----------- ----- ----------
5bcf2026-d387-c17e-9755-246e96ccc730 0 true 10.62.45.71 12321
5aed2ebf-7377-e34e-95a7-246e96ccc790 0 true 10.62.45.73 12321
5aedafb5-2826-7ba4-7d56-246e96ccd6b0 0 true 10.62.45.70 12321
[root@b-p-vxrail-03:~]
Check if all Virtual SAN hosts are in the same network segment
[root@esxi-03:~] esxcli network ip neighbor list -i vmk3
Neighbor Mac Address Vmknic Expiry State Type
----------- ----------------- ------ ------- ----- -------
10.62.45.73 00:50:56:67:c0:87 vmk3 625 sec Unknown
10.62.45.71 00:50:56:65:72:29 vmk3 627 sec Unknown
10.62.45.70 00:50:56:60:a6:8e vmk3 628 sec Unknown
Validate if the vSAN is communication with each other through the UDP port 12321 <Requires vSAN’s vmk>
[root@esxi-h03:~] tcpdump-uw -i vmk3 port 12321 tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode listening on vmk3, link-type EN10MB (Ethernet), capture size 262144 bytes 22:47:22.734356 IP 10.11.107.22.12321 > 10.11.107.20.12321: UDP, length 400 22:47:27.736888 IP 10.11.107.22.12321 > 10.11.107.23.12321: UDP, length 400 22:47:27.737094 IP 10.11.107.22.12321 > 10.11.107.21.12321: UDP, length 400 22:47:27.737344 IP 10.11.107.21.12321 > 10.11.107.22.12321: UDP, length 200 22:47:27.737396 IP 10.11.107.20.12321 > 10.11.107.22.12321: UDP, length 464
Check the disk groups available
[root@esxi-03:~] esxcli vsan storage list | grep "VSAN Disk Group UUID:" | sort | uniq -c 4 VSAN Disk Group UUID: 520243a4-8bb7-c95a-a6e9-28655e26febd 4 VSAN Disk Group UUID: 52f0561b-e656-8f9a-17da-adb120a1544a
Check disk’s location
[root@esxi-03:~]esxcli storage core device physical get -d naa.5002538a9823cfd0 Physical Location: enclosure 1, slot 6
Hope you enjoyed this post and don’t forget to share and comment.