<div dir="ltr">In attempting to put an ovirt cluster in production I'm running into some off errors with gluster it looks like. Its 12 hosts each with one brick in distributed-replicate. (actually 2 bricks but they are separate volumes)<div><br></div><div>
<p class=""><span class="">[root@ovirt-node268 glusterfs]# rpm -qa | grep vdsm</span></p>
<p class=""><span class="">vdsm-jsonrpc-4.16.20-0.el6.noarch</span></p>
<p class=""><span class="">vdsm-gluster-4.16.20-0.el6.noarch</span></p>
<p class=""><span class="">vdsm-xmlrpc-4.16.20-0.el6.noarch</span></p>
<p class=""><span class="">vdsm-yajsonrpc-4.16.20-0.el6.noarch</span></p>
<p class=""><span class="">vdsm-4.16.20-0.el6.x86_64</span></p>
<p class=""><span class="">vdsm-python-zombiereaper-4.16.20-0.el6.noarch</span></p>
<p class=""><span class="">vdsm-python-4.16.20-0.el6.noarch</span></p>
<p class=""><span class="">vdsm-cli-4.16.20-0.el6.noarch</span></p><p class=""><br></p><p class=""> Everything was fine last week, however, today various clients in the gluster cluster seem get "client quorum not met" periodically - when they get this they take one of the bricks offline - this causes VM's to be attempted to move - sometimes 20 at a time. That takes a long time :-(. I've tried disabling automatic migration and teh VM's get paused when this happens - resuming gets nothing at that point as the volumes mount on the server hosting the VM is not connected:</p><div><br></div><div><p class="">from rhev-data-center-mnt-glusterSD-ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02.log:</p><p class=""><span class="">[2015-09-08 21:18:42.920771] W [MSGID: 108001] [afr-common.c:4043:afr_notify] 2-LADC-TBX-V02-replicate-2: Client-quorum is </span><span class="">not met</span></p><p class=""><span class="">[2015-09-08 21:18:42.931751] I [fuse-bridge.c:4900:fuse_thread_proc] 0-fuse: unmounting /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02</span></p><p class=""><span class="">[2015-09-08 21:18:42.931836] W [glusterfsd.c:1219:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7a51) [0x7f1bebc84a51] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d] -->/usr/sbin/glusterfs(cleanup_and_exit+0x</span></p><p class=""><span class="">65) [0x4059b5] ) 0-: received signum (15), shutting down</span></p><p class=""><span class="">[2015-09-08 21:18:42.931858] I [fuse-bridge.c:5595:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02'.</span></p><p class=""><span class=""><br></span></p><p class=""><span class="">And the mount is broken at that point:</span></p></div><div><p class=""><span class="">[root@ovirt-node267 ~]# df</span></p><p class=""><span class=""><font color="#ff0000"><b>df: `/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02': Transport endpoint is not connected</b></font></span></p><p class=""><span class="">Filesystem 1K-blocks Used Available Use% Mounted on</span></p><p class=""><span class="">/dev/sda3 51475068 1968452 46885176 5% /</span></p><p class=""><span class="">tmpfs 132210244 0 132210244 0% /dev/shm</span></p><p class=""><span class="">/dev/sda2 487652 32409 429643 8% /boot</span></p><p class=""><span class="">/dev/sda1 204580 260 204320 1% /boot/efi</span></p><p class=""><span class="">/dev/sda5 1849960960 156714056 1599267616 9% /data1</span></p><p class=""><span class="">/dev/sdb1 1902274676 18714468 1786923588 2% /data2</span></p><p class=""><span class="">ovirt-node268.la.taboolasyndication.com:/LADC-TBX-V01</span></p><p class=""><span class=""> 9249804800 727008640 8052899712 9% /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V01</span></p><p class=""><span class="">ovirt-node251.la.taboolasyndication.com:/LADC-TBX-V03</span></p><p class=""><span class=""> 1849960960 73728 1755907968 1% /rhev/data-center/mnt/glusterSD/ovirt-node251.la.taboolasyndication.com:_LADC-TBX-V03</span></p><p class="">The fix for that is to put the server in maintenance mode then activate it again. But all VM's need to be migrated or stopped for that to work.</p></div><div><br></div><div>I'm not seeing any obvious network or disk errors...... </div></div><div><br></div><div>Are their configuration options I'm missing?</div><div><br></div></div>