Clomd service crashes on host

In vSAN 7.x and 8.x clomd service crashes very frequently and even after restarting/starting the service it doesn’t keep the service started.

Snippets from clomd logs:

=====================================
2023-07-14T11:09:01.416Z PANIC: NOT_REACHED bora/lib/vsan/vsan_config_builder.c:744
2023-07-14T11:09:01.416Z Backtrace:
2023-07-14T11:09:01.416Z Backtrace[0] 0000030b4742c6a0 rip=000000bf0c7de98f rbx=0000030b4742c6a0 rbp=0000030b4742cad0 r12=000000bf0d677788 r13=0000030b4742cae8 r14=000000bf14ce052c r15=000000bf14ce3c2c
2023-07-14T11:09:01.416Z Backtrace[1] 0000030b4742cae0 rip=000000bf0c7dea5b rbx=000000bf14ce0544 rbp=0000030b4742cbb0 r12=000000becbd3e3e0 r13=0000000000000000 r14=000000bf14ce052c r15=000000bf14ce3c2c
2023-07-14T11:09:01.416Z Backtrace[2] 0000030b4742cbc0 rip=000000becb780fab rbx=000000bf14ce0544 rbp=0000030b4742cbc0 r12=000000becbd3e3e0 r13=0000000000000000 r14=000000bf14ce052c r15=000000bf14ce3c2c
2023-07-14T11:09:01.416Z SymBacktrace[3] 0000030b4742cbd0 rip=000000becb770311 in function (null) in object /usr/lib/vmware/vsan/bin/clomd loaded at 000000becb666000
2023-07-14T11:09:01.416Z SymBacktrace[4] 0000030b4742cff0 rip=000000becb77390e in 
2023-07-14T11:09:01.416Z SymBacktrace[12] 0000030b4742ea80 rip=000000becb77e8cc in function (null) in object /usr/lib/vmware/vsan/bin/clomd loaded at 000000becb666000
2023-07-14T11:09:01.416Z SymBacktrace[13] 0000030b4742eb00 rip=000000becb67fd2a in function (null) in object /usr/lib/vmware/vsan/bin/clomd loaded at 000000becb666000
2023-07-14T11:09:01.417Z SymBacktrace[14] 0000030b4742ec70 rip=000000bf0d2ebd5d in function __libc_start_main in object /lib64/libc.so.6 loaded at 000000bf0d2ca000
2023-07-14T11:09:01.417Z SymBacktrace[15] 0000030b4742ed30 rip=000000becb68081d in function (null) in object /usr/lib/vmware/vsan/bin/clomd loaded at 000000becb666000
2023-07-14T11:09:01.417Z SymBacktrace[16] 0000030b4742ed38 rip=0000000000000000
2023-07-14T11:09:01.417Z Failed to dump core: Failure.
2023-07-14T11:09:01.417Z Msg_Post: Error
2023-07-14T11:09:01.417Z [msg.log.error.unrecoverable] vSAN Cluster level Object Manager unrecoverable error: (host-5028003)
2023-07-14T11:09:01.417Z NOT_REACHED bora/lib/vsan/vsan_config_builder.c:744
2023-07-14T11:09:01.417Z [msg.panic.requestSupport.withoutLog] You can request support.
2023-07-14T11:09:01.417Z [msg.panic.requestSupport.vmSupport.vmx86]
2023-07-14T11:09:01.417Z To collect data to submit to VMware technical support, run "vm-support".
2023-07-14T11:09:01.417Z [msg.panic.response] We will respond on the basis of your support entitlement.
2023-07-14T11:09:01.417Z ----------------------------------------
2023-07-14T11:09:01.417Z Exiting

=====================================

Cause:

During object format change, config has both the old and new layout but here the old layout was already partially cleaned up,
leaving the config in an invalid state and because of this, we observe the crash whenever clom tries to process that object as part of reconfigurations other than cleanup.

In most of the cases we have observed crashes during VOTES_REBALANCE since VOTES_REBALANCE workItem has a higher priority, though clom was posting CLEANUP workitem it was not getting processed causing clomd to crash.

Workaround/Fix:
Upgrade the ESXi host to 8.0a or 7.0U3i

VirtuallyVTrue

Everything True About Virtualization

Clomd service crashes on host

Share this:

Related