Re: [ovirt-users] Recovering from a multi-node failure

6 Aug 2017

      Well, after a very stressful weekend, I think I have things largely
working.  Turns out that most of the above issues were caused by the linux
permissions of the exports for all three volumes (they had been reset to
600; setting them to 774 or 770 fixed many of the issues).  Of course, I
didn't find that until a much more harrowing outage, and hours and hours of
work, including beginning to look at rebuilding my cluster....

So, now my cluster is operating again, and everything looks good EXCEPT for
one major Gluster issue/question that I haven't found any references or
info on.

my host ovirt2, one of the replica gluster servers, is the one that lost
its storage and had to reinitialize it from the cluster.  the iso volume is
perfectly fine and complete, but the engine and data volumes are smaller on
disk on this node than on the other node (and this node before the crash).
On the engine store, the entire cluster reports the smaller utilization on
mounted gluster filesystems; on the data partition, it reports the larger
size (rest of cluster).  Here's some df statments to help clarify:

(brick1 = engine; brick2=data, brick4=iso):
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/gluster-engine      25G   12G   14G  47% /gluster/brick1
/dev/mapper/gluster-data       136G  125G   12G  92% /gluster/brick2
/dev/mapper/gluster-iso         25G  7.3G   18G  29% /gluster/brick4
192.168.8.11:/engine            15G  9.7G  5.4G  65%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine
192.168.8.11:/data             136G  125G   12G  92%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_data
192.168.8.11:/iso               13G  7.3G  5.8G  56%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_iso

View from ovirt2:
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/gluster-engine      15G  9.7G  5.4G  65% /gluster/brick1
/dev/mapper/gluster-data       174G  119G   56G  69% /gluster/brick2
/dev/mapper/gluster-iso         13G  7.3G  5.8G  56% /gluster/brick4
192.168.8.11:/engine            15G  9.7G  5.4G  65%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine
192.168.8.11:/data             136G  125G   12G  92%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_data
192.168.8.11:/iso               13G  7.3G  5.8G  56%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_iso

As you can see, in the process of rebuilding the hard drive for ovirt2, I
did resize some things to give more space to data, where I desperately need
it.  If this goes well and the storage is given a clean bill of health at
this time, then I will take ovirt1 down and resize to match ovirt2, and
thus score a decent increase in storage for data.  I fully realize that
right now the gluster mounted volumes should have the total size as the
least common denominator.

So, is this size reduction appropriate?  A big part of me thinks data is
missing, but I even went through and shut down ovirt2's gluster daemons,
wiped all the gluster data, and restarted gluster to allow it a fresh heal
attempt, and it again came back to the exact same size.  This cluster was
originally built about the time ovirt 4.0 came out, and has been upgraded
to 'current', so perhaps some new gluster features are making more
efficient use of space (dedupe or something)?

Thank  you for your assistance!
--JIm

On Fri, Aug 4, 2017 at 7:49 PM, Jim Kusznir <jim@palousetech.com> wrote:
...
Hi all:
Today has been rough.  two of my three nodes went down today, and self
heal has not been healing well.  4 hours later, VMs are running.  but the
engine is not happy.  It claims the storage domain is down (even though it
is up on all hosts and VMs are running).  I'm getting a ton of these
messages logging:
VDSM engine3 command HSMGetAllTasksStatusesVDS failed: Not SPM
Aug 4, 2017 7:23:00 PM
VDSM engine3 command SpmStatusVDS failed: Error validating master storage
domain: ('MD read error',)
Aug 4, 2017 7:22:49 PM
VDSM engine3 command ConnectStoragePoolVDS failed: Cannot find master
domain: u'spUUID=5868392a-0148-02cf-014d-000000000121,
msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'
Aug 4, 2017 7:22:47 PM
VDSM engine1 command ConnectStoragePoolVDS failed: Cannot find master
domain: u'spUUID=5868392a-0148-02cf-014d-000000000121,
msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'
Aug 4, 2017 7:22:46 PM
VDSM engine2 command SpmStatusVDS failed: Error validating master storage
domain: ('MD read error',)
Aug 4, 2017 7:22:44 PM
VDSM engine2 command ConnectStoragePoolVDS failed: Cannot find master
domain: u'spUUID=5868392a-0148-02cf-014d-000000000121,
msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'
Aug 4, 2017 7:22:42 PM
VDSM engine1 command HSMGetAllTasksStatusesVDS failed: Not SPM: ()
------------
I cannot set an SPM as it claims the storage domain is down; I cannot set
the storage domain up.
Also in the storage realm, one of my exports shows substantially less data
than is actually there.
Here's what happened, as best as I understood them:
I went to do maintence on ovirt2 (needed to replace a faulty ram stick and
rework the disk).  I put it in maintence mode, then shut it down and did my
work.  In the process, much of the disk contents was lost (all the gluster
data).  I figure, no big deal, the gluster data is redundant on the
network, it will heal when it comes back up.
While I was doing maintence, all but one of the VMs were running on
engine1.  When I turned on engine2, all of the sudden, all vms including
the main engine stop and go non-responsive.  As far as I can tell, this
should not have happened, as I turned ON one host, but none the less, I
waited for recovery to occur (while customers started calling asking why
everything stopped working....).  As I waited, I  was checking, and gluster
volume status only showed ovirt1 and ovirt2....Apparently gluster had
stopped/failed at some point on ovirt3.  I assume that was the cause of the
outage, still, if everything was working fine with ovirt1 gluster, and
ovirt2 powers on with a very broke gluster (the volume status was showing
NA for the port fileds for the gluster volumes), I would not expect to have
a working gluster go stupid like that.
After starting ovirt3 glusterd and checking the status, all three showed
ovirt1 and ovirt3 as operational, and ovirt2 as NA.  Unfortunately,
recovery was still not happening, so I did some googling and found about
the commands to inquire about the hosted-engine status.  It appeared to be
stuck "paused" and I couldn't find a way to unpause it, so I poweroff'ed
it, then started it manually on engine 1, and the cluster came back up.  It
showed all VMs paused.  I was able to unpause them and they worked again.
So now I began to work the ovirt2 gluster healing problem.  It didn't
appear to be self-healing, but eventually I found this document:
https://support.rackspace.com/how-to/recover-from-a-failed-
server-in-a-glusterfs-array/
and from that found the magic xattr commands.  After setting them, gluster
volumes on ovirt2 came online.  I told iso to heal, and it did but only
came up about half as much data as it should have.  I told it heal full,
and it did finish off the remaining data, and came up to full.  I then told
engine to do a full heal (gluster volume heal engine full), and it
transferred its data from the other gluster hosts too.  However, it said it
was done when it hit 9.7GB while there was 15GB on disk!  It is still stuck
that way; ovirt gui and gluster volume heal engine info both show the
volume fully healed, but it is not:
[root@ovirt1 ~]# df -h
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/centos_ovirt-root   20G  4.2G   16G  21% /
devtmpfs                        16G     0   16G   0% /dev
tmpfs                           16G   16K   16G   1% /dev/shm
tmpfs                           16G   26M   16G   1% /run
tmpfs                           16G     0   16G   0% /sys/fs/cgroup
/dev/mapper/gluster-engine      25G   12G   14G  47% /gluster/brick1
/dev/sda1                      497M  315M  183M  64% /boot
/dev/mapper/gluster-data       136G  124G   13G  92% /gluster/brick2
/dev/mapper/gluster-iso         25G  7.3G   18G  29% /gluster/brick4
tmpfs                          3.2G     0  3.2G   0% /run/user/0
192.168.8.11:/engine            15G  9.7G  5.4G  65%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine
192.168.8.11:/data             136G  124G   13G  92%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_data
192.168.8.11:/iso               13G  7.3G  5.8G  56%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_iso
This is from ovirt1, and before the work, both ovirt1 and ovirt2's brings
had the same usage.  ovirt2's bricks and the gluster mountpoints agree on
iso and engine, but as you can see, not here.  If I do a du -sh on
/rhev/data-center/mnt/glusterSD/..../_engine, it comes back with the 12GB
number (/brick1 is engine, brick2 is data and brick4 is iso).  However,
gluster still says its only 9.7G.  I haven't figured out how to get it to
finish "healing".
data is in the process of healing currently.
So, I think I have two main things to solve right now:
1) how do I get ovirt to see the data center/storage domain as online
again?
2) How do I get engine to finish healing to ovirt2?
Thanks all for reading this very long message!
--Jim