I have an HCI cluster running on Gluster storage. I exposed an NFS share into oVirt as a
storage domain so that I could clone all of my VMs (I'm preparing to move physically
to a new datacenter). I got 3-4 VMs cloned perfectly fine yesterday. But then this
evening, I tried to clone a big VM, and it caused the disk to lock up. The VM went totally
unresponsive, and I didn't see a way to cancel the clone. Nagios NRPE (on the client
VM) was reporting server load over 65+, but I was never able to establish an SSH
connection.
Eventually, I tried restarting the ovirt-engine, per
https://access.redhat.com/solutions/396753. When that didn't work, I powered down the
VM completely. But the disks were still locked. So I then tried to put the storage domain
into maintenance mode, but that wound up putting the entire domain into a
"locked" state. Finally, eventually, the disks unlocked, and I was able to power
the VM back online.
From start to finish, my VM was down for about 45 minutes, including the time when NRPE
was still sending data to Nagios.
What logs should I look at, and how can I troubleshoot what went wrong here, and hopefully
avoid this from happening again?
Sent with ProtonMail Secure Email.