Hi, I had experience two time of 3-node hyper-converged 4.2.8 ovirt cluster total outage
due to vdsm reactivate the unresponsive node, and cause the multiple glusterfs daemon
restart. As a result, all VM was paused and some of disk image was corrupted.
At the very beginning, one of the ovirt node was overloaded due to high memory and CPU,
the hosted-engine have trouble to collect status from vdsm and mark it as unresponsive and
it start migrate the workload to healthy node. However, when it start migrate, second
ovirt node being unresponsive where vdsm try reactivate the 1st unresponsive node and
restart it's glusterd. So the gluster domain was acquiring the quorum and waiting for
timeout.
If 1st node reactivation is success and every other node can survive the timeout, it will
be an idea case. Unfortunately, the second node cannot pick up the VM being migrated due
to gluster I/O timeout, so second node at that moment was marked as unresponsive, and so
on... vdsm is restarting the glusterd on the second node which cause disaster. All node
are racing on gluster volume self-healing, and i can't mark the cluster as maintenance
mode as well. What I can do is try to resume the paused VM via virsh and issue shutdown
for each domain, also hard shutdown for un-resumable VM.
After number of VM shutdown and wait the gluster healing completed, the cluster state
back to normal, and I try to start the VM being manually stopped, most of them can be
started normally, but number of VM was crashed or un-startable, instantly I found the
image files of un-startable VM was owned by root(can't explain why), and can be
restarted after chmod. Two of them still cannot start with "bad volume
specification" error. One of them can start to boot loader, but the LVM metadata were
lost.
The impact was huge when vdsm restart the glusterd without human invention.