[ovirt-users] Fwd: Re: urgent issue

Tue Sep 22 13:24:33 UTC 2015

Sorry - its too late - all hosts have been re-imaged and are setup as local
storage.

On Mon, Sep 21, 2015 at 10:38 PM, Ravishankar N <ravishankar at redhat.com>
wrote:

> Hi Chris,
>
> Replies inline..
>
> On 09/22/2015 09:31 AM, Sahina Bose wrote:
>
>
>
>
> -------- Forwarded Message -------- Subject: Re: [ovirt-users] urgent
> issue Date: Wed, 9 Sep 2015 08:31:07 -0700 From: Chris Liebman
> <chris.l at taboola.com> <chris.l at taboola.com> To: users <users at ovirt.org>
> <users at ovirt.org>
>
> Ok - I think I'm going to switch to local storage - I've had way to many
> unexplainable issue with glusterfs Â :-(.Â  Is there any reason I cant add
> local storage to the existing shared-storage cluster?Â  I see that the menu
> item is greyed out....
>
>
>
> What version of gluster and ovirt are you using?
>
>
>
>
> On Tue, Sep 8, 2015 at 4:19 PM, Chris Liebman <chris.l at taboola.com> wrote:
>
>> Its possible that this is specific to just one gluster volume...Â  I've
>> moved a few VM disks off of that volume and am able to start them fine.Â
>> My recolection is that any VM started on the "bad" volume causes it to be
>> disconnected and forces the ovirt node to be marked down until
>> Maint->Activate.
>>
>> On Tue, Sep 8, 2015 at 3:52 PM, Chris Liebman < <chris.l at taboola.com>
>> chris.l at taboola.com> wrote:
>>
>>> In attempting to put an ovirt cluster in production I'm running into
>>> some off errors with gluster it looks like.Â  Its 12 hosts each with one
>>> brick in distributed-replicate. Â (actually 2 bricks but they are separate
>>> volumes)
>>>
>>>
> These 12 nodes in dist-rep config, are they in replica 2 or replica 3? The
> latter is what is recommended for VM use-cases. Could you give the output
> of `gluster volume info` ?
>
> [root at ovirt-node268 glusterfs]# rpm -qa | grep vdsm
>>>
>>> vdsm-jsonrpc-4.16.20-0.el6.noarch
>>>
>>> vdsm-gluster-4.16.20-0.el6.noarch
>>>
>>> vdsm-xmlrpc-4.16.20-0.el6.noarch
>>>
>>> vdsm-yajsonrpc-4.16.20-0.el6.noarch
>>>
>>> vdsm-4.16.20-0.el6.x86_64
>>>
>>> vdsm-python-zombiereaper-4.16.20-0.el6.noarch
>>>
>>> vdsm-python-4.16.20-0.el6.noarch
>>>
>>> vdsm-cli-4.16.20-0.el6.noarch
>>>
>>>
>>> Â  Â Everything was fine last week, however, today various clients in
>>> the gluster cluster seem get "client quorum not met" periodically - when
>>> they get this they take one of the bricks offline - this causes VM's to be
>>> attempted to move - sometimes 20 at a time.Â  That takes a long time :-(.
>>> I've tried disabling automatic migration and teh VM's get paused when this
>>> happens - resuming gets nothing at that point as the volumes mount on the
>>> server hosting the VM is not connected:
>>>
>>> from
>>> rhev-data-center-mnt-glusterSD-ovirt-node268.la.taboolasyndication.com:
>>> _LADC-TBX-V02.log:
>>>
>>> [2015-09-08 21:18:42.920771] W [MSGID: 108001]
>>> [afr-common.c:4043:afr_notify] 2-LADC-TBX-V02-replicate-2: Client-quorum
>>> isÂ not met
>>>
>>
> When client-quorum is not met (due to network disconnects, or gluster
> brick processes going down etc), gluster makes the volume read-only. This
> is expected behavior and prevents split-brains. It's probably a bit late,
> but do you have the  gluster fuse mount logs to confirm this indeed was the
> issue?
>
> [2015-09-08 21:18:42.931751] I [fuse-bridge.c:4900:fuse_thread_proc]
>>> 0-fuse: unmounting
>>> /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:
>>> _LADC-TBX-V02
>>>
>>> [2015-09-08 21:18:42.931836] W [glusterfsd.c:1219:cleanup_and_exit]
>>> (-->/lib64/libpthread.so.0(+0x7a51) [0x7f1bebc84a51]
>>> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d]
>>> -->/usr/sbin/glusterfs(cleanup_and_exit+0x
>>>
>>> 65) [0x4059b5] ) 0-: received signum (15), shutting down
>>>
>>> [2015-09-08 21:18:42.931858] I [fuse-bridge.c:5595:fini] 0-fuse:
>>> Unmounting
>>> '/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:
>>> _LADC-TBX-V02'.
>>>
>>
> The VM pause you saw could be because of the unmount.I understand that a
> fix (https://gerrit.ovirt.org/#/c/40240/)  went in for ovirt 3-.6
> (vdsm-4.17) to prevent vdsm from unmounting the gluster volume when vdsm
> exits/restarts.
> Is it possible to run a test setup on 3.6 and see if this is still
> happening?
>
>
>>> And the mount is broken at that point:
>>>
>>> [root at ovirt-node267 ~]# df
>>>
>>> *df:
>>> `/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02':
>>> Transport endpoint is not connected*
>>>
>>
> Yes because it received a SIGTERM above.
>
> Thanks,
> Ravi
>
> FilesystemÂ  Â  Â  Â  Â  Â Â 1K-blocksÂ  Â  Â Â UsedÂ Â Available Use%
>>> Mounted on
>>>
>>> /dev/sda3Â  Â  Â  Â  Â  Â
>>> Â Â 51475068Â Â Â 1968452Â Â Â 46885176Â Â Â 5% /
>>>
>>> tmpfsÂ Â  Â  Â  Â  Â  Â  Â  Â Â 132210244Â Â  Â  Â
>>> Â Â 0Â Â 132210244Â Â Â 0% /dev/shm
>>>
>>> /dev/sda2Â  Â  Â  Â  Â  Â  Â  Â Â 487652Â Â  Â Â 32409Â Â
>>> Â Â 429643Â Â Â 8% /boot
>>>
>>> /dev/sda1Â  Â  Â  Â  Â  Â  Â  Â Â 204580Â Â  Â  Â Â 260Â Â
>>> Â Â 204320Â Â Â 1% /boot/efi
>>>
>>> /dev/sda5Â  Â  Â  Â  Â  Â Â 1849960960 156714056 1599267616Â Â Â 9%
>>> /data1
>>>
>>> /dev/sdb1Â  Â  Â  Â  Â  Â Â 1902274676Â Â 18714468 1786923588Â Â Â 2%
>>> /data2
>>>
>>> ovirt-node268.la.taboolasyndication.com:/LADC-TBX-V01
>>>
>>> Â Â  Â  Â  Â  Â  Â  Â  Â  Â  Â Â 9249804800 727008640 8052899712Â Â Â 9%
>>> /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:
>>> _LADC-TBX-V01
>>>
>>> ovirt-node251.la.taboolasyndication.com:/LADC-TBX-V03
>>>
>>> Â Â  Â  Â  Â  Â  Â  Â  Â  Â  Â Â 1849960960Â Â  Â Â 73728
>>> 1755907968Â Â Â 1%
>>> /rhev/data-center/mnt/glusterSD/ovirt-node251.la.taboolasyndication.com:
>>> _LADC-TBX-V03
>>>
>>> The fix for that is to put the server in maintenance mode then activate
>>> it again. But all VM's need to be migrated or stopped for that to work.
>>>
>>> I'm not seeing any obvious network or disk errors......Â
>>>
>>> Are their configuration options I'm missing?
>>>
>>>
>>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150922/dead765b/attachment-0001.html>