<div dir="ltr">Its possible that this is specific to just one gluster volume... I've moved a few VM disks off of that volume and am able to start them fine. My recolection is that any VM started on the "bad" volume causes it to be disconnected and forces the ovirt node to be marked down until Maint->Activate.</div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Sep 8, 2015 at 3:52 PM, Chris Liebman <span dir="ltr"><<a href="mailto:chris.l@taboola.com" target="_blank">chris.l@taboola.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">In attempting to put an ovirt cluster in production I'm running into some off errors with gluster it looks like. Its 12 hosts each with one brick in distributed-replicate. (actually 2 bricks but they are separate volumes)<div><br></div><div>
<p><span>[root@ovirt-node268 glusterfs]# rpm -qa | grep vdsm</span></p>
<p><span>vdsm-jsonrpc-4.16.20-0.el6.noarch</span></p>
<p><span>vdsm-gluster-4.16.20-0.el6.noarch</span></p>
<p><span>vdsm-xmlrpc-4.16.20-0.el6.noarch</span></p>
<p><span>vdsm-yajsonrpc-4.16.20-0.el6.noarch</span></p>
<p><span>vdsm-4.16.20-0.el6.x86_64</span></p>
<p><span>vdsm-python-zombiereaper-4.16.20-0.el6.noarch</span></p>
<p><span>vdsm-python-4.16.20-0.el6.noarch</span></p>
<p><span>vdsm-cli-4.16.20-0.el6.noarch</span></p><p><br></p><p> Everything was fine last week, however, today various clients in the gluster cluster seem get "client quorum not met" periodically - when they get this they take one of the bricks offline - this causes VM's to be attempted to move - sometimes 20 at a time. That takes a long time :-(. I've tried disabling automatic migration and teh VM's get paused when this happens - resuming gets nothing at that point as the volumes mount on the server hosting the VM is not connected:</p><div><br></div><div><p>from rhev-data-center-mnt-glusterSD-ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02.log:</p><p><span>[2015-09-08 21:18:42.920771] W [MSGID: 108001] [afr-common.c:4043:afr_notify] 2-LADC-TBX-V02-replicate-2: Client-quorum is </span><span>not met</span></p><p><span>[2015-09-08 21:18:42.931751] I [fuse-bridge.c:4900:fuse_thread_proc] 0-fuse: unmounting /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02</span></p><p><span>[2015-09-08 21:18:42.931836] W [glusterfsd.c:1219:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7a51) [0x7f1bebc84a51] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d] -->/usr/sbin/glusterfs(cleanup_and_exit+0x</span></p><p><span>65) [0x4059b5] ) 0-: received signum (15), shutting down</span></p><p><span>[2015-09-08 21:18:42.931858] I [fuse-bridge.c:5595:fini] 0-fuse: Unmounting '/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02'.</span></p><p><span><br></span></p><p><span>And the mount is broken at that point:</span></p></div><div><p><span>[root@ovirt-node267 ~]# df</span></p><p><span><font color="#ff0000"><b>df: `/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02': Transport endpoint is not connected</b></font></span></p><p><span>Filesystem 1K-blocks Used Available Use% Mounted on</span></p><p><span>/dev/sda3 51475068 1968452 46885176 5% /</span></p><p><span>tmpfs 132210244 0 132210244 0% /dev/shm</span></p><p><span>/dev/sda2 487652 32409 429643 8% /boot</span></p><p><span>/dev/sda1 204580 260 204320 1% /boot/efi</span></p><p><span>/dev/sda5 1849960960 156714056 1599267616 9% /data1</span></p><p><span>/dev/sdb1 1902274676 18714468 1786923588 2% /data2</span></p><p><span>ovirt-node268.la.taboolasyndication.com:/LADC-TBX-V01</span></p><p><span> 9249804800 727008640 <a href="tel:8052899712" value="+18052899712" target="_blank">8052899712</a> 9% /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V01</span></p><p><span>ovirt-node251.la.taboolasyndication.com:/LADC-TBX-V03</span></p><p><span> 1849960960 73728 1755907968 1% /rhev/data-center/mnt/glusterSD/ovirt-node251.la.taboolasyndication.com:_LADC-TBX-V03</span></p><p>The fix for that is to put the server in maintenance mode then activate it again. But all VM's need to be migrated or stopped for that to work.</p></div><div><br></div><div>I'm not seeing any obvious network or disk errors...... </div></div><div><br></div><div>Are their configuration options I'm missing?</div><div><br></div></div>
</blockquote></div><br></div>