Hi all:
Today has been rough. two of my three nodes went down today, and self heal has not been healing well. 4 hours later, VMs are running. but the engine is not happy. It claims the storage domain is down (even though it is up on all hosts and VMs are running). I'm getting a ton of these messages logging:
VDSM engine3 command HSMGetAllTasksStatusesVDS failed: Not SPM
Aug 4, 2017 7:23:00 PM
VDSM engine3 command SpmStatusVDS failed: Error validating master storage domain: ('MD read error',)
Aug 4, 2017 7:22:49 PM
VDSM engine3 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5868392a-0148-02cf-014d-000000000121, msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'
Aug 4, 2017 7:22:47 PM
VDSM engine1 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5868392a-0148-02cf-014d-000000000121, msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'
Aug 4, 2017 7:22:46 PM
VDSM engine2 command SpmStatusVDS failed: Error validating master storage domain: ('MD read error',)
Aug 4, 2017 7:22:44 PM
VDSM engine2 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5868392a-0148-02cf-014d-000000000121, msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'
Aug 4, 2017 7:22:42 PM
VDSM engine1 command HSMGetAllTasksStatusesVDS failed: Not SPM: ()
------------
I cannot set an SPM as it claims the storage domain is down; I cannot set the storage domain up.
Also in the storage realm, one of my exports shows substantially less data than is actually there.
Here's what happened, as best as I understood them:
I went to do maintence on ovirt2 (needed to replace a faulty ram stick and rework the disk). I put it in maintence mode, then shut it down and did my work. In the process, much of the disk contents was lost (all the gluster data). I figure, no big deal, the gluster data is redundant on the network, it will heal when it comes back up.
While I was doing maintence, all but one of the VMs were running on engine1. When I turned on engine2, all of the sudden, all vms including the main engine stop and go non-responsive. As far as I can tell, this should not have happened, as I turned ON one host, but none the less, I waited for recovery to occur (while customers started calling asking why everything stopped working....). As I waited, I was checking, and gluster volume status only showed ovirt1 and ovirt2....Apparently gluster had stopped/failed at some point on ovirt3. I assume that was the cause of the outage, still, if everything was working fine with ovirt1 gluster, and ovirt2 powers on with a very broke gluster (the volume status was showing NA for the port fileds for the gluster volumes), I would not expect to have a working gluster go stupid like that.
After starting ovirt3 glusterd and checking the status, all three showed ovirt1 and ovirt3 as operational, and ovirt2 as NA. Unfortunately, recovery was still not happening, so I did some googling and found about the commands to inquire about the hosted-engine status. It appeared to be stuck "paused" and I couldn't find a way to unpause it, so I poweroff'ed it, then started it manually on engine 1, and the cluster came back up. It showed all VMs paused. I was able to unpause them and they worked again.
So now I began to work the ovirt2 gluster healing problem. It didn't appear to be self-healing, but eventually I found this document:
and from that found the magic xattr commands. After setting them, gluster volumes on ovirt2 came online. I told iso to heal, and it did but only came up about half as much data as it should have. I told it heal full, and it did finish off the remaining data, and came up to full. I then told engine to do a full heal (gluster volume heal engine full), and it transferred its data from the other gluster hosts too. However, it said it was done when it hit 9.7GB while there was 15GB on disk! It is still stuck that way; ovirt gui and gluster volume heal engine info both show the volume fully healed, but it is not:
[root@ovirt1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos_ovirt-root 20G 4.2G 16G 21% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 16K 16G 1% /dev/shm
tmpfs 16G 26M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/gluster-engine 25G 12G 14G 47% /gluster/brick1
/dev/sda1 497M 315M 183M 64% /boot
/dev/mapper/gluster-data 136G 124G 13G 92% /gluster/brick2
/dev/mapper/gluster-iso 25G 7.3G 18G 29% /gluster/brick4
tmpfs 3.2G 0 3.2G 0% /run/user/0
192.168.8.11:/engine 15G 9.7G 5.4G 65% /rhev/data-center/mnt/glusterSD/192.168.8.11:_engine
192.168.8.11:/data 136G 124G 13G 92% /rhev/data-center/mnt/glusterSD/192.168.8.11:_data
192.168.8.11:/iso 13G 7.3G 5.8G 56% /rhev/data-center/mnt/glusterSD/192.168.8.11:_iso
This is from ovirt1, and before the work, both ovirt1 and ovirt2's brings had the same usage. ovirt2's bricks and the gluster mountpoints agree on iso and engine, but as you can see, not here. If I do a du -sh on /rhev/data-center/mnt/glusterSD/..../_engine, it comes back with the 12GB number (/brick1 is engine, brick2 is data and brick4 is iso). However, gluster still says its only 9.7G. I haven't figured out how to get it to finish "healing".
data is in the process of healing currently.
So, I think I have two main things to solve right now:
1) how do I get ovirt to see the data center/storage domain as online again?
2) How do I get engine to finish healing to ovirt2?
Thanks all for reading this very long message!
--Jim