[ovirt-users] Fwd: Re: urgent issue

Tue Sep 22 05:38:46 UTC 2015

Hi Chris,

Replies inline..

On 09/22/2015 09:31 AM, Sahina Bose wrote:
>
>
>
> -------- Forwarded Message --------
> Subject: 	Re: [ovirt-users] urgent issue
> Date: 	Wed, 9 Sep 2015 08:31:07 -0700
> From: 	Chris Liebman <chris.l at taboola.com>
> To: 	users <users at ovirt.org>
>
>
>
> Ok - I think I'm going to switch to local storage - I've had way to 
> many unexplainable issue with glusterfs Â :-(.Â  Is there any reason I 
> cant add local storage to the existing shared-storage cluster?Â  I see 
> that the menu item is greyed out....
>
>

What version of gluster and ovirt are you using?

>
>
>
> On Tue, Sep 8, 2015 at 4:19 PM, Chris Liebman <chris.l at taboola.com 
> <mailto:chris.l at taboola.com>> wrote:
>
>     Its possible that this is specific to just one gluster volume...Â 
>     I've moved a few VM disks off of that volume and am able to start
>     them fine.Â  My recolection is that any VM started on the "bad"
>     volume causes it to be disconnected and forces the ovirt node to
>     be marked down until Maint->Activate.
>
>     On Tue, Sep 8, 2015 at 3:52 PM, Chris Liebman
>     <chris.l at taboola.com> wrote:
>
>         In attempting to put an ovirt cluster in production I'm
>         running into some off errors with gluster it looks like.Â  Its
>         12 hosts each with one brick in distributed-replicate.
>         Â (actually 2 bricks but they are separate volumes)
>

These 12 nodes in dist-rep config, are they in replica 2 or replica 3? 
The latter is what is recommended for VM use-cases. Could you give the 
output of `gluster volume info` ?
>
>         [root at ovirt-node268 glusterfs]# rpm -qa | grep vdsm
>
>         vdsm-jsonrpc-4.16.20-0.el6.noarch
>
>         vdsm-gluster-4.16.20-0.el6.noarch
>
>         vdsm-xmlrpc-4.16.20-0.el6.noarch
>
>         vdsm-yajsonrpc-4.16.20-0.el6.noarch
>
>         vdsm-4.16.20-0.el6.x86_64
>
>         vdsm-python-zombiereaper-4.16.20-0.el6.noarch
>
>         vdsm-python-4.16.20-0.el6.noarch
>
>         vdsm-cli-4.16.20-0.el6.noarch
>
>
>         Â  Â Everything was fine last week, however, today various
>         clients in the gluster cluster seem get "client quorum not
>         met" periodically - when they get this they take one of the
>         bricks offline - this causes VM's to be attempted to move -
>         sometimes 20 at a time.Â  That takes a long time :-(. I've
>         tried disabling automatic migration and teh VM's get paused
>         when this happens - resuming gets nothing at that point as the
>         volumes mount on the server hosting the VM is not connected:
>
>
>         from
>         rhev-data-center-mnt-glusterSD-ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02.log:
>
>         [2015-09-08 21:18:42.920771] W [MSGID: 108001]
>         [afr-common.c:4043:afr_notify] 2-LADC-TBX-V02-replicate-2:
>         Client-quorum isÂ not met
>

When client-quorum is not met (due to network disconnects, or gluster 
brick processes going down etc), gluster makes the volume read-only. 
This is expected behavior and prevents split-brains. It's probably a bit 
late, but do you have the  gluster fuse mount logs to confirm this 
indeed was the issue?

>         [2015-09-08 21:18:42.931751] I
>         [fuse-bridge.c:4900:fuse_thread_proc] 0-fuse: unmounting
>         /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02
>
>         [2015-09-08 21:18:42.931836] W
>         [glusterfsd.c:1219:cleanup_and_exit]
>         (-->/lib64/libpthread.so.0(+0x7a51) [0x7f1bebc84a51]
>         -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d]
>         -->/usr/sbin/glusterfs(cleanup_and_exit+0x
>
>         65) [0x4059b5] ) 0-: received signum (15), shutting down
>
>         [2015-09-08 21:18:42.931858] I [fuse-bridge.c:5595:fini]
>         0-fuse: Unmounting
>         '/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02'.
>

The VM pause you saw could be because of the unmount.I understand that a 
fix (https://gerrit.ovirt.org/#/c/40240/)  went in for ovirt 3-.6 
(vdsm-4.17) to prevent vdsm from unmounting the gluster volume when vdsm 
exits/restarts.
Is it possible to run a test setup on 3.6 and see if this is still 
happening?

>
>         And the mount is broken at that point:
>
>         [root at ovirt-node267 ~]# df
>
>         *df:
>         `/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02':
>         Transport endpoint is not connected*
>

Yes because it received a SIGTERM above.

Thanks,
Ravi
>
>         FilesystemÂ  Â  Â  Â  Â Â Â 1K-blocksÂ  Â
>         Â Â UsedÂ Â Available Use% Mounted on
>
>         /dev/sda3Â  Â  Â  Â  Â  Â
>         Â Â 51475068Â Â Â 1968452Â Â Â 46885176Â Â Â 5% /
>
>         tmpfsÂ Â  Â  Â  Â  Â  Â  Â Â Â 132210244Â Â  Â  Â
>         Â Â 0Â Â 132210244Â Â Â 0% /dev/shm
>
>         /dev/sda2Â  Â  Â  Â  Â  Â  Â Â Â 487652Â Â  Â Â 32409Â Â
>         Â Â 429643Â Â Â 8% /boot
>
>         /dev/sda1Â  Â  Â  Â  Â  Â  Â Â Â 204580Â Â  Â  Â Â 260Â Â
>         Â Â 204320Â Â Â 1% /boot/efi
>
>         /dev/sda5Â  Â  Â  Â  Â Â Â 1849960960 156714056
>         1599267616Â Â Â 9% /data1
>
>         /dev/sdb1Â  Â  Â  Â  Â Â Â 1902274676Â Â 18714468
>         1786923588Â Â Â 2% /data2
>
>         ovirt-node268.la.taboolasyndication.com:/LADC-TBX-V01
>
>         Â Â  Â  Â  Â  Â  Â  Â  Â  Â Â Â 9249804800 727008640
>         8052899712 <tel:8052899712>Â Â Â 9%
>         /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V01
>
>         ovirt-node251.la.taboolasyndication.com:/LADC-TBX-V03
>
>         Â Â  Â  Â  Â  Â  Â  Â  Â  Â Â Â 1849960960Â Â  Â Â 73728
>         1755907968Â Â Â 1%
>         /rhev/data-center/mnt/glusterSD/ovirt-node251.la.taboolasyndication.com:_LADC-TBX-V03
>
>         The fix for that is to put the server in maintenance mode then
>         activate it again. But all VM's need to be migrated or stopped
>         for that to work.
>
>
>         I'm not seeing any obvious network or disk errors......Â
>
>         Are their configuration options I'm missing?
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150922/5f878dee/attachment-0001.html>