[ovirt-users] urgent issue

Chris Liebman chris.l at taboola.com
Tue Sep 8 22:52:07 UTC 2015


In attempting to put an ovirt cluster in production I'm running into some
off errors with gluster it looks like.  Its 12 hosts each with one brick in
distributed-replicate.  (actually 2 bricks but they are separate volumes)

[root at ovirt-node268 glusterfs]# rpm -qa | grep vdsm

vdsm-jsonrpc-4.16.20-0.el6.noarch

vdsm-gluster-4.16.20-0.el6.noarch

vdsm-xmlrpc-4.16.20-0.el6.noarch

vdsm-yajsonrpc-4.16.20-0.el6.noarch

vdsm-4.16.20-0.el6.x86_64

vdsm-python-zombiereaper-4.16.20-0.el6.noarch

vdsm-python-4.16.20-0.el6.noarch

vdsm-cli-4.16.20-0.el6.noarch


   Everything was fine last week, however, today various clients in the
gluster cluster seem get "client quorum not met" periodically - when they
get this they take one of the bricks offline - this causes VM's to be
attempted to move - sometimes 20 at a time.  That takes a long time :-(.
I've tried disabling automatic migration and teh VM's get paused when this
happens - resuming gets nothing at that point as the volumes mount on the
server hosting the VM is not connected:

from rhev-data-center-mnt-glusterSD-ovirt-node268.la.taboolasyndication.com:
_LADC-TBX-V02.log:

[2015-09-08 21:18:42.920771] W [MSGID: 108001]
[afr-common.c:4043:afr_notify] 2-LADC-TBX-V02-replicate-2: Client-quorum is not
met

[2015-09-08 21:18:42.931751] I [fuse-bridge.c:4900:fuse_thread_proc]
0-fuse: unmounting
/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:
_LADC-TBX-V02

[2015-09-08 21:18:42.931836] W [glusterfsd.c:1219:cleanup_and_exit]
(-->/lib64/libpthread.so.0(+0x7a51) [0x7f1bebc84a51]
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x

65) [0x4059b5] ) 0-: received signum (15), shutting down

[2015-09-08 21:18:42.931858] I [fuse-bridge.c:5595:fini] 0-fuse: Unmounting
'/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:
_LADC-TBX-V02'.


And the mount is broken at that point:

[root at ovirt-node267 ~]# df

*df:
`/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02':
Transport endpoint is not connected*

Filesystem            1K-blocks      Used  Available Use% Mounted on

/dev/sda3              51475068   1968452   46885176   5% /

tmpfs                 132210244         0  132210244   0% /dev/shm

/dev/sda2                487652     32409     429643   8% /boot

/dev/sda1                204580       260     204320   1% /boot/efi

/dev/sda5            1849960960 156714056 1599267616   9% /data1

/dev/sdb1            1902274676  18714468 1786923588   2% /data2

ovirt-node268.la.taboolasyndication.com:/LADC-TBX-V01

                     9249804800 727008640 8052899712   9%
/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:
_LADC-TBX-V01

ovirt-node251.la.taboolasyndication.com:/LADC-TBX-V03

                     1849960960     73728 1755907968   1%
/rhev/data-center/mnt/glusterSD/ovirt-node251.la.taboolasyndication.com:
_LADC-TBX-V03

The fix for that is to put the server in maintenance mode then activate it
again. But all VM's need to be migrated or stopped for that to work.

I'm not seeing any obvious network or disk errors......

Are their configuration options I'm missing?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150908/43528a5c/attachment-0001.html>


More information about the Users mailing list