Hi Jorge,
In short our environment is set up like this:
oVirt Cluster (stand alone cluster manager) contains 5 physical servers (nodes), each node
contains up to 5 VM (Linux/Windows)
from a storage perspective this is set up as a "host-cluster" as IBM calls it,
meaning that all 5 nodes have access to all volumes in the cluster, for instance if node 1
hosts VM 1 and 2, node 2 hosts VM 3 and 4 and so on, every node has access to all volumes,
and the volumes can be listed on each node using multipath -ll, regardless of the volume
being used by a VM on the specific node or not.
When expanding a volume or a set of volumes, this is done on the storage level (block
level), when this is done the volume is located on the specific VM and a refresh LUN is
issued from the GUI. In some cases this works fine in other cases it crashes the VM...
Eventlog from manager:
Jan 29, 2025, 8:28:31 AM - VM xxxxx is down with error. Exit message: Lost connection with
qemu process.
Jan 29, 2025, 8:27:45 AM - VM xxxxx is not responding.
Jan 29, 2025, 8:27:16 AM - Status of host yyyyyy was set to Up.
Jan 29, 2025, 8:27:15 AM - VM xxxxx is not responding.
Jan 29, 2025, 8:26:57 AM - Host cluster zzzzzz was updated by system
Jan 29, 2025, 8:26:49 AM - Host yyyyyy is not responding. Host cannot be fenced
automatically because power management for the host is disabled.
Jan 29, 2025, 8:26:49 AM - Direct LUN synchronization failed.
Jan 29, 2025, 8:26:49 AM - VDSM yyyyyy command ExtendVmDiskSizeVDS failed: Message timeout
which can be caused by communication issues
Jan 29, 2025, 8:26:30 AM - VM xxxxxx is not responding.
Jan 29, 2025, 8:26:00 AM - VM xxxxxx is not responding.
Jan 29, 2025, 8:24:45 AM - VM xxxxxx is not responding.
Jan 29, 2025, 8:23:40 AM - Direct LUN synchronization started.
From here the only way to get the VM to respond is to log into the node where the VM
resides and through virsh do a destroy xxxxxx
/John