I have an HCI cluster running on Gluster storage. I exposed an NFS share into oVirt as a storage domain so that I could clone all of my VMs (I'm preparing to move physically to a new datacenter). I got 3-4 VMs cloned perfectly fine yesterday. But then this evening, I tried to clone a big VM, and it caused the disk to lock up. The VM went totally unresponsive, and I didn't see a way to cancel the clone. Nagios NRPE (on the client VM) was reporting server load over 65+, but I was never able to establish an SSH connection.

Eventually, I tried restarting the ovirt-engine, per https://access.redhat.com/solutions/396753. When that didn't work, I powered down the VM completely. But the disks were still locked. So I then tried to put the storage domain into maintenance mode, but that wound up putting the entire domain into a "locked" state. Finally, eventually, the disks unlocked, and I was able to power the VM back online.

From start to finish, my VM was down for about 45 minutes, including the time when NRPE was still sending data to Nagios.

What logs should I look at, and how can I troubleshoot what went wrong here, and hopefully avoid this from happening again?