How to troubleshoot vSAN issues?

Hope you are doing all great, for today’s post I wanted to put together some of the commands/troubleshooting I’ve had used with VMware vSAN,

Identify a partitioned node from a VSAN Cluster (Hosts)

What a single partitioned node looks like:

~ # esxcli vsan cluster get
Cluster Information
 Enabled: true
 Current Local Time: 2020-10-25T10:35:19Z
 Local Node UUID: 507e7bd5-ad2f-6424-66cb-1cc1de253de4
 Local Node State: MASTER
 Local Node Health State: HEALTHY
 Sub-Cluster Master UUID: 507e7bd5-ad2f-6424-66cb-1cc1de253de4
 Sub-Cluster Backup UUID:
 Sub-Cluster UUID: 52e4fbe6-7fe4-9e44-f9eb-c2fc1da77631
 Sub-Cluster Membership Entry Revision: 7
 Sub-Cluster Member UUIDs: 507e7bd5-ad2f-6424-66cb-1cc1de253de4
 Sub-Cluster Membership UUID: ba45d050-2e84-c490-845f-1cc1de253de4
~ #

What a full 4-node cluster looks like (no partition)

~ # esxcli vsan cluster get
Cluster Information
 Enabled: true
 Current Local Time: 2020-10-25T10:35:19Z
 Local Node UUID: 54188e3a-84fd-9a38-23ba-001b21168828
 Local Node State: MASTER
 Local Node Health State: HEALTHY
 Sub-Cluster Master UUID: 54188e3a-84fd-9a38-23ba-001b21168828
 Sub-Cluster Backup UUID: 545ca9af-ff4b-fc84-dcee-001f29595f9f
 Sub-Cluster UUID: 529ccbe4-81d2-89bc-7a70-a9c69bd23a19
 Sub-Cluster Membership Entry Revision: 3
 Sub-Cluster Member UUIDs: 54188e3a-84fd-9a38-23ba-001b21168828, 545ca9af-ff4bfc84-dcee-001f29595f9f, 5460b129-4084-7550-46e1-0010185def78, 54196e13-7f5f-cba8-
5bac-001517a69c72
 Sub-Cluster Membership UUID: 80757454-2e11-d20f-3fb0-001b21168828

Check if there are ongoing/stuck resync operations (hosts)

while true;do echo " ****************************************** " ; echo "" > /tmp/resyncStats.txt ;cmmds-tool find -t DOM_OBJECT -f json |grep uuid |awk -F \" '{print $4}' |while read i;do pendingResync=$(cmmds-tool  find -t DOM_OBJECT -f json -u $i|grep -o "\"bytesToSync\": [0-9]*,"|awk -F " |," '{sum+=$2} END{print sum / 1024 / 1024 / 1024;}');if [ ${#pendingResync} -ne 1 ]; then echo "$i: $pendingResync GiB";fi;done |tee -a /tmp/resyncStats.txt;total=$(cat /tmp/resyncStats.txt |awk '{sum+=$2} END{print sum}');echo "Total: $total GiB" |tee -aa /tmp/resyncStats.txt;total=$(cat /tmp/resyncStats.txt  |grep Total);totalObj=$(cat /tmp/resyncStats.txt|grep -vE " 0 GiB|Total"|wc -l);echo "`date +%Y-%m-%dT%H:%M:%SZ` $total ($totalObj objects)" >> /tmp/totalHistory.txt; echo `date ` ; sleep 60 ;done

No resync operations

Total: 0 GiB
Sun Oct 25 00:30:29 UTC 2020

Pending resync operation:

ba8d625b-6457-b3ac-6da9-e4434b016608: 212.125 GiB
Total: 212.125 GiB
Sun Oct 25 00:43:29 UTC 2029

Check the state of the components (Hosts)

cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

Good state:

772 state\": 7 --healthy

Good state:

2 state\": 13 --inaccessible objects 
6 state\": 15 --absent/degraded

Find the UUID of the healhty (state 7), inaccessible objects (state 13) and absent/degraded (stage 15). Replace the stage number to filter out the results per state, for example, here are the two inaccessible objects:

cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep 'state\\\": 13' -B 1 | grep uuid | cut -d "\"" -f4 
7328b659-9d15-cb40-68eb-a81e846157fb 
e52bb359-d0e0-3ce2-7070-a81e846153d9

If you find objects on state 13 or 15, you better contact VMware ASAP to see if they can be recovered.

Disk Balance Test (Hosts)

esxcli vsan health cluster get -t "vSAN Disk Balance"

A balanced scenario

vSAN Disk Balance green
Overview
Metric Value
---------------------------------
Average Disk Usage 10 %
Maximum Disk Usage 12 %
Maximum Variance 4 %
LM Balance Index 2 %

Disk Balance
Host Device Rebalance State Data To Move
------------------------------------------------------------------------------------------------------------------------

A not balanced scenario

vSAN Disk Balance yellow

Overview
Metric Value
---------------------------------
Average Disk Usage 27 %
Maximum Disk Usage 45 %
Maximum Variance 37 %
LM Balance Index 19 %

Disk Balance
Host Device Rebalance State Data To Move
------------------------------------------------------------------------------------------------------------------------
172.16.11.246 Local TOSHIBA Disk (naa.50000398b821a7c5) Proactive rebalance is in progress 420.5244 GB

Verify which devices from a particular host are in which disk groups (Hosts)

[root@esxi-1:~] vdq -iH

Good mapping

Mappings: 
DiskMapping[0]: 
SSD: naa.58ce38ee2016ffe5 
MD: naa.5002538a4819e540 
MD: naa.5002538a4819e510 
MD: naa.5002538a4819e3e0 

DiskMapping[2]: 
SSD: naa.58ce38ee2016fe55 
MD: naa.5002538a48199ca0 
MD: naa.5002538a48199e20 
MD: naa.5002538a48199e00

Something is not right… (missing disks)

Mappings: 
DiskMapping[0]: 
SSD: naa.58ce38ee2016ffe5 
MD: naa.5002538a4819e3e0 

DiskMapping[2]: 
SSD: naa.58ce38ee2016fe55 
MD: naa.5002538a48199ca0 
MD: naa.5002538a48199e20 
MD: naa.5002538a48199e00

Find devices that are in PDL state (Hosts)

Run the command vdq –qH and search for IsPDL? if the value is equal to 1 the device is in PDL state.

[root@esxi-04:~] vdq -qH
DiskResults:
 DiskResult[0]:
 Name: naa.600508b1001c4b820b4d80f9f8acfa95
 VSANUUID: 5294bbd8-67c4-c545-3952-7711e365f7fa
 State: In-use for VSAN
 ChecksumSupport: 0
 Reason: Non-local disk
 IsSSD?: 0
IsCapacityFlash?: 0
 IsPDL?: 0
 <<truncated>>
 DiskResult[18]:
 Name:
 VSANUUID: 5227c17e-ec64-de76-c10e-c272102beba7
 State: In-use for VSAN
 ChecksumSupport: 0
 Reason: None
 IsSSD?: 0
IsCapacityFlash?: 0
 IsPDL?: 1          = Device is in PDL state

Determine if there are enough reources remaining in the cluster to rebuild by simulating a host failure scenario (RVC – vCenter)

RVC command vsan.whatif_host_failures helps to know if there are the sufficient amount of resources by simulating a host failure, make sure to connect to your vCenter-RVC.

/localhost/virtuallyvtrue/computers> vsan.whatif_host_failures 0
Simulating 1 host failures:
+-----------------+-----------------------------+-----------------------------------+
| Resource | Usage right now | Usage after failure/re-protection |
+-----------------+-----------------------------+-----------------------------------+
| HDD capacity | 5% used (3635.62 GB free) | 7% used (2680.12 GB free) |
| Components | 3% used (11687 available) | 3% used (8687 available) |
| RC reservations | 0% used (3142.27 GB free) | 0% used (2356.71 GB free) |
+-----------------+-----------------------------+-----------------------------------+
/localhost/virtuallyvtrue/computers>

Testing Virtual SAN functionality – deploying VMs – (RVC – vCenter)

/localhost/virtuallyvtrue/computers> diagnostics.vm_create --
datastore ../datastores/vsanDatastore --vm-folder ../vms/Discovered\ virtual\ machine 0
Creating one VM per host ... (timeout = 180 sec)
Success
/localhost/virtuallyvtrue/computers>

Start and Stop proactive rebalances (RVC – vCenter)

/vcsa-01.virtuallyvtrue.local/virtuallyvtrue/computers> vsan.proactive_rebalance_info 0
2020-10-25 14:14:27 +0000: Retrieving proactive rebalance information from host esxi-02.virtuallyvtrue.local ...
2020-10-25 14:14:27 +0000: Retrieving proactive rebalance information from host esxi-04.virtuallyvtrue.local ...
2020-10-25 14:14:27 +0000: Retrieving proactive rebalance information from host esxi-01.virtuallyvtrue.local ...
2020-10-25 14:14:27 +0000: Retrieving proactive rebalance information from host esxi-03.virtuallyvtrue.local ...



Proactive rebalance is not running!
Max usage difference triggering rebalancing: 30.00%
Average disk usage: 5.00%
Maximum disk usage: 26.00% (21.00% above mean)
Imbalance index: 5.00%
No disk detected to be rebalanced

Start the proactive rebalance - vsan.proactive_rebalance -s 0

/vcsa-01.virtuallyvtrue.local/virtuallyvtrue/computers> vsan.proactive_rebalance -s 0
2020-10-25 14:15:05 +0000: Processing Virtual SAN proactive rebalance on host esxi-02.virtuallyvtrue.local ...
2020-10-25 14:15:05 +0000: Processing Virtual SAN proactive rebalance on host esxi-04.virtuallyvtrue.local ...
2020-10-25 14:15:05 +0000: Processing Virtual SAN proactive rebalance on host esxi-01.virtuallyvtrue.local ...
2020-10-25 14:15:05 +0000: Processing Virtual SAN proactive rebalance on host esxi-03.virtuallyvtrue.local ...


Proactive rebalance has been started!

/vcsa-01.virtuallyvtrue.local/virtuallyvtrue/computers> vsan.proactive_rebalance_info 0
2020-10-25 14:15:11 +0000: Retrieving proactive rebalance information from host esxi-02.virtuallyvtrue.local ...
2020-10-25 14:15:11 +0000: Retrieving proactive rebalance information from host esxi-04.virtuallyvtrue.local ...
2020-10-25 14:15:11 +0000: Retrieving proactive rebalance information from host esxi-01.virtuallyvtrue.local ...
2020-10-25 14:15:11 +0000: Retrieving proactive rebalance information from host esxi-03.virtuallyvtrue.local ...

Proactive rebalance start: 2020-10-25 14:13:10 UTC
Proactive rebalance stop: 2020-10-25 14:16:17 UTC
Max usage difference triggering rebalancing: 30.00%
Average disk usage: 5.00%
Maximum disk usage: 26.00% (21.00% above mean)
Imbalance index: 5.00%
No disk detected to be rebalanced

Stop the proactive rebalance – vsan.proactive_rebalance -o 0

/vcsa-01.virtuallyvtrue.local/virtuallyvtrue/computers> vsan.proactive_rebalance -o 0
2020-10-25 14:15:45 +0000: Processing Virtual SAN proactive rebalance on host esxi-01.virtuallyvtrue.local ...
2020-10-25 14:15:45 +0000: Processing Virtual SAN proactive rebalance on host esxi-02.virtuallyvtrue.local ...
2020-10-25 14:15:45 +0000: Processing Virtual SAN proactive rebalance on host esxi-04.virtuallyvtrue.local ....
2020-10-25 14:15:45 +0000: Processing Virtual SAN proactive rebalance on host esxi-03.virtuallyvtrue.local ...
Proactive rebalance has been stopped!

Verify the health of the Virtual SAN Cluster by their objects’ states (RVC – vCenter)

/vcsa-01.virtuallyvtrue.local/virtuallyvtrue/computers> vsan.obj_status_report 0
2020-10-25 12:47:04 +0000: Querying all VMs on VSAN ...
2020-10-25 12:47:04 +0000: Querying all objects in the system from esxi-02.virtuallyvtrue.local ...
2020-10-25 12:47:04 +0000: Querying all disks in the system from esxi-02.virtuallyvtrue.local ...
2020-10-25 12:47:05 +0000: Querying all components in the system from esxi-02.virtuallyvtrue.local ...
2020-10-25 12:47:05 +0000: Querying all object versions in the system ...
2020-10-25 12:47:06 +0000: Got all the info, computing table ...
Histogram of component health for non-orphaned objects

+-------------------------------------+------------------------------+
| Num Healthy Comps / Total Num Comps | Num objects with such status |
+-------------------------------------+------------------------------+
| 11/11 (OK)						  | 30                           |
| 3/3 (OK) 							  | 6                            |
+-------------------------------------+------------------------------+
Total non-orphans: 36

Histogram of component health for possibly orphaned objects

+-------------------------------------+------------------------------+
| Num Healthy Comps / Total Num Comps | Num objects with such status |
+-------------------------------------+------------------------------+
+-------------------------------------+------------------------------+
Total orphans: 0

Total v1 objects: 0
Total v2 objects: 0
Total v2.5 objects: 0
Total v3 objects: 0
Total v5 objects: 36

Check if the unicast agents are able to see all nodes (Hosts)

[root@esxi-03:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address    Port  Iface Name
------------------------------------  ---------  ----------------  -----------  -----  ----------
5bcf2026-d387-c17e-9755-246e96ccc730          0              true  10.62.45.71  12321
5aed2ebf-7377-e34e-95a7-246e96ccc790          0              true  10.62.45.73  12321
5aedafb5-2826-7ba4-7d56-246e96ccd6b0          0              true  10.62.45.70  12321
[root@b-p-vxrail-03:~]

Check if all Virtual SAN hosts are in the same network segment

[root@esxi-03:~] esxcli network ip neighbor list -i vmk3
Neighbor     Mac Address        Vmknic   Expiry  State  Type
-----------  -----------------  ------  -------  -----  -------
10.62.45.73  00:50:56:67:c0:87  vmk3    625 sec         Unknown
10.62.45.71  00:50:56:65:72:29  vmk3    627 sec         Unknown
10.62.45.70  00:50:56:60:a6:8e  vmk3    628 sec         Unknown

Validate if the vSAN is communication with each other through the UDP port 12321 <Requires vSAN’s vmk>

[root@esxi-h03:~] tcpdump-uw -i vmk3 port 12321
tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmk3, link-type EN10MB (Ethernet), capture size 262144 bytes
22:47:22.734356 IP 10.11.107.22.12321 > 10.11.107.20.12321: UDP, length 400
22:47:27.736888 IP 10.11.107.22.12321 > 10.11.107.23.12321: UDP, length 400
22:47:27.737094 IP 10.11.107.22.12321 > 10.11.107.21.12321: UDP, length 400
22:47:27.737344 IP 10.11.107.21.12321 > 10.11.107.22.12321: UDP, length 200
22:47:27.737396 IP 10.11.107.20.12321 > 10.11.107.22.12321: UDP, length 464

Check the disk groups available

[root@esxi-03:~] esxcli vsan storage list | grep "VSAN Disk Group UUID:" | sort | uniq -c
      4    VSAN Disk Group UUID: 520243a4-8bb7-c95a-a6e9-28655e26febd
      4    VSAN Disk Group UUID: 52f0561b-e656-8f9a-17da-adb120a1544a

Check disk’s location

[root@esxi-03:~]esxcli storage core device physical get -d naa.5002538a9823cfd0
   Physical Location: enclosure 1, slot 6

Hope you enjoyed this post and don’t forget to share and comment.

VirtuallyVTrue

Everything True About Virtualization

How to troubleshoot vSAN issues?

Share this:

Related