[root@ovirt-node268 glusterfs]# rpm -qa | grep vdsm
vdsm-jsonrpc-4.16.20-0.el6.noarch
vdsm-gluster-4.16.20-0.el6.noarch
vdsm-xmlrpc-4.16.20-0.el6.noarch
vdsm-yajsonrpc-4.16.20-0.el6.noarch
vdsm-4.16.20-0.el6.x86_64
vdsm-python-zombiereaper-4.16.20-0.el6.noarch
vdsm-python-4.16.20-0.el6.noarch
vdsm-cli-4.16.20-0.el6.noarch
Everything was fine last week, however, today various clients in the gluster cluster seem get "client quorum not met" periodically - when they get this they take one of the bricks offline - this causes VM's to be attempted to move - sometimes 20 at a time. That takes a long time :-(. I've tried disabling automatic migration and teh VM's get paused when this happens - resuming gets nothing at that point as the volumes mount on the server hosting the VM is not connected:
from rhev-data-center-mnt-glusterSD-ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02.log:
[2015-09-08 21:18:42.920771] W [MSGID: 108001] [afr-common.c:4043:afr_notify] 2-LADC-TBX-V02-replicate-2: Client-quorum is not met
[2015-09-08 21:18:42.931751] I [fuse-bridge.c:4900:fuse_thread_proc] 0-fuse: unmounting /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02
[2015-09-08 21:18:42.931836] W [glusterfsd.c:1219:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7a51) [0x7f1bebc84a51] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d] -->/usr/sbin/glusterfs(cleanup_and_exit+0x
65) [0x4059b5] ) 0-: received signum (15), shutting down
[2015-09-08 21:18:42.931858] I [fuse-bridge.c:5595:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02'.
And the mount is broken at that point:
[root@ovirt-node267 ~]# df
df: `/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02': Transport endpoint is not connected
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda3 51475068 1968452 46885176 5% /
tmpfs 132210244 0 132210244 0% /dev/shm
/dev/sda2 487652 32409 429643 8% /boot
/dev/sda1 204580 260 204320 1% /boot/efi
/dev/sda5 1849960960 156714056 1599267616 9% /data1
/dev/sdb1 1902274676 18714468 1786923588 2% /data2
ovirt-node268.la.taboolasyndication.com:/LADC-TBX-V01
9249804800 727008640 8052899712 9% /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V01
ovirt-node251.la.taboolasyndication.com:/LADC-TBX-V03
1849960960 73728 1755907968 1% /rhev/data-center/mnt/glusterSD/ovirt-node251.la.taboolasyndication.com:_LADC-TBX-V03
The fix for that is to put the server in maintenance mode then activate it again. But all VM's need to be migrated or stopped for that to work.