[Users] Stopping glusterfsd service shut down data center

Hi all, I'm testing ovirt+glusterfs with only two node for all (engine, glusterfs, hypervisors), on centos 6.5 hosts following guide at: http://community.redhat.com/blog/2013/09/up-and-running-with-ovirt-3-3/ http://www.gluster.org/2013/09/ovirt-3-3-glusterized/ but with some change like setting on glusterfs, parameter cluster.server-quorum-ratio to 50% (due to prevent glusterfs to go down if one node goes done) and option on /etc/glusterfs/glusterd.vol "option base-port 50152" (due to libvirt port conflict). So, with the above parameter I was able to stop/reboot node not used to directly mount glusterfs (eg lovhm002), but when I stop/reboot node, that is used to mount glusterfs (eg node lovhm001), all data center goes done, especially when I stop service glusterfsd (not glusterd service!!!), but the glusterfs still alive and is reachable on node lovhm002 that survives but ovirt/libvirt marks DC/storage in error. Do you have any ideas to configure DC/Cluster on ovirt that remains aware if node used to mount glusterfs goes down? This is a sample vdsmd.log on node that remains on-line (lovhm002) when I stopped services glusterd and glusterfsd node lovhm001: Thread-294::DEBUG::2014-01-05 19:12:32,475::task::1168::TaskManager.Task::(prepare) Task=`a003cde0-a11a-489e-94c2-611f3d096a81`::finished: {'info': {'spm_id': 2, 'master_uuid': '85ad5f7d-3b67-4618-a871-f9ec886020a4', 'name': 'PROD', 'version': '3', 'domains': '85ad5f7d-3b67-4618-a871-f9ec886020a4:Active,6a9b4fa6-f393-4036-bd4e-0bc9dccb1594:Active', 'pool_status': 'connected', 'isoprefix': '/rhev/data-center/mnt/lovhm001.fabber.it:_var_lib_exports_iso/6a9b4fa6-f393-4036-bd4e-0bc9dccb1594/images/11111111-1111-1111-1111-111111111111', 'type': 'GLUSTERFS', 'master_ver': 2, 'lver': 3}, 'dominfo': {'85ad5f7d-3b67-4618-a871-f9ec886020a4': {'status': 'Active', 'diskfree': '374350675968', 'alerts': [], 'version': 3, 'disktotal': '375626137600'}, '6a9b4fa6-f393-4036-bd4e-0bc9dccb1594': {'status': 'Active', 'diskfree': '45249200128', 'alerts': [], 'version': 0, 'disktotal': '51604619264'}}} Thread-294::DEBUG::2014-01-05 19:12:32,476::task::579::TaskManager.Task::(_updateState) Task=`a003cde0-a11a-489e-94c2-611f3d096a81`::moving from state preparing -> state finished Thread-294::DEBUG::2014-01-05 19:12:32,476::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.2eceb484-73e0-464a-965b-69f067918080': < ResourceRef 'Storage.2eceb484-73e0-464a-965b-69f067918080', isValid: 'True' obj: 'None'>} Thread-294::DEBUG::2014-01-05 19:12:32,476::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-294::DEBUG::2014-01-05 19:12:32,476::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.2eceb484-73e0-464a-965b-69f067918080' Thread-294::DEBUG::2014-01-05 19:12:32,476::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.2eceb484-73e0-464a-965b-69f067918080' (0 active users) Thread-294::DEBUG::2014-01-05 19:12:32,476::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.2eceb484-73e0-464a-965b-69f067918080' is free, finding out if anyone is waiting for it. Thread-294::DEBUG::2014-01-05 19:12:32,476::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.2eceb484-73e0-464a-965b-69f067918080', Clearing records. Thread-294::DEBUG::2014-01-05 19:12:32,476::task::974::TaskManager.Task::(_decref) Task=`a003cde0-a11a-489e-94c2-611f3d096a81`::ref 0 aborting False Thread-296::DEBUG::2014-01-05 19:12:36,070::BindingXMLRPC::974::vds::(wrapper) client [5.39.66.85]::call volumesList with () {} flowID [76f294ea] Thread-296::DEBUG::2014-01-05 19:12:36,079::BindingXMLRPC::981::vds::(wrapper) return volumesList with {'status': {'message': 'Done', 'code': 0}, 'volumes': {'vmdata': {'transportType': ['TCP'], 'uuid': 'e9b05f7a-f392-44f3-9d44-04761c36437d', 'bricks': ['lovhm001.fabber.it:/vmdata', 'lovhm002.fabber.it:/vmdata'], 'volumeName': 'vmdata', 'volumeType': 'REPLICATE', 'replicaCount': '2', 'brickCount': '2', 'distCount': '2', 'volumeStatus': 'ONLINE', 'stripeCount': '1', 'options': {'cluster.server-quorum-type': 'server', 'cluster.eager-lock': 'enable', 'performance.stat-prefetch': 'off', 'auth.allow': '*', 'cluster.quorum-type': 'auto', 'performance.quick-read': 'off', 'network.remote-dio': 'enable', 'nfs.disable': 'on', 'performance.io-cache': 'off', 'server.allow-insecure': 'on', 'storage.owner-uid': '36', 'user.cifs': 'disable', 'performance.read-ahead': 'off', 'storage.owner-gid': '36', 'cluster.server-quorum-ratio': '50%'}}}} libvirtEventLoop::INFO::2014-01-05 19:12:38,856::vm::4266::vm.Vm::(_onAbnormalStop) vmId=`88997598-1db3-478a-bbe2-a7d234cfdc77`::abnormal vm stop device virtio-disk0 error eother libvirtEventLoop::DEBUG::2014-01-05 19:12:38,856::vm::4840::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`88997598-1db3-478a-bbe2-a7d234cfdc77`::event Suspended detail 2 opaque None Thread-28::DEBUG::2014-01-05 19:12:40,996::fileSD::239::Storage.Misc.excCmd::(getReadDelay) '/bin/dd iflag=direct if=/rhev/data-center/mnt/lovhm001.fabber.it:_var_lib_exports_iso/6a9b4fa6-f393-4036-bd4e-0bc9dccb1594/dom_md/metadata bs=4096 count=1' (cwd None) Thread-28::DEBUG::2014-01-05 19:12:41,001::fileSD::239::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n361 bytes (361 B) copied, 0.00017524 s, 2.1 MB/s\n'; <rc> = 0 Thread-298::DEBUG::2014-01-05 19:12:41,088::BindingXMLRPC::974::vds::(wrapper) client [5.39.66.85]::call volumesList with () {} Thread-298::DEBUG::2014-01-05 19:12:41,097::BindingXMLRPC::981::vds::(wrapper) return volumesList with {'status': {'message': 'Done', 'code': 0}, 'volumes': {'vmdata': {'transportType': ['TCP'], 'uuid': 'e9b05f7a-f392-44f3-9d44-04761c36437d', 'bricks': ['lovhm001.fabber.it:/vmdata', 'lovhm002.fabber.it:/vmdata'], 'volumeName': 'vmdata', 'volumeType': 'REPLICATE', 'replicaCount': '2', 'brickCount': '2', 'distCount': '2', 'volumeStatus': 'ONLINE', 'stripeCount': '1', 'options': {'cluster.server-quorum-type': 'server', 'cluster.eager-lock': 'enable', 'performance.stat-prefetch': 'off', 'auth.allow': '*', 'cluster.quorum-type': 'auto', 'performance.quick-read': 'off', 'network.remote-dio': 'enable', 'nfs.disable': 'on', 'performance.io-cache': 'off', 'server.allow-insecure': 'on', 'storage.owner-uid': '36', 'user.cifs': 'disable', 'performance.read-ahead': 'off', 'storage.owner-gid': '36', 'cluster.server-quorum-ratio': '50%'}}}} Thread-300::DEBUG::2014-01-05 19:12:41,345::task::579::TaskManager.Task::(_updateState) Task=`d049a71c-70a4-4dc2-9d69-99f1561ab405`::moving from state init -> state preparing Thread-300::INFO::2014-01-05 19:12:41,345::logUtils::44::dispatcher::(wrapper) Run and protect: repoStats(options=None) Thread-300::INFO::2014-01-05 19:12:41,345::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'85ad5f7d-3b67-4618-a871-f9ec886020a4': {'delay': '0.000370968', 'lastCheck': '8.9', 'code': 0, 'valid': True, 'version': 3}, '6a9b4fa6-f393-4036-bd4e-0bc9dccb1594': {'delay': '0.00017524', 'lastCheck': '0.3', 'code': 0, 'valid': True, 'version': 0}} Thread-300::DEBUG::2014-01-05 19:12:41,346::task::1168::TaskManager.Task::(prepare) Task=`d049a71c-70a4-4dc2-9d69-99f1561ab405`::finished: {'85ad5f7d-3b67-4618-a871-f9ec886020a4': {'delay': '0.000370968', 'lastCheck': '8.9', 'code': 0, 'valid': True, 'version': 3}, '6a9b4fa6-f393-4036-bd4e-0bc9dccb1594': {'delay': '0.00017524', 'lastCheck': '0.3', 'code': 0, 'valid': True, 'version': 0}} Thread-300::DEBUG::2014-01-05 19:12:41,346::task::579::TaskManager.Task::(_updateState) Task=`d049a71c-70a4-4dc2-9d69-99f1561ab405`::moving from state preparing -> state finished Thread-300::DEBUG::2014-01-05 19:12:41,346::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-300::DEBUG::2014-01-05 19:12:41,346::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-300::DEBUG::2014-01-05 19:12:41,346::task::974::TaskManager.Task::(_decref) Task=`d049a71c-70a4-4dc2-9d69-99f1561ab405`::ref 0 aborting False Thread-27::DEBUG::2014-01-05 19:12:42,459::fileSD::239::Storage.Misc.excCmd::(getReadDelay) '/bin/dd iflag=direct if=/rhev/data-center/mnt/glusterSD/lovhm001:_vmdata/85ad5f7d-3b67-4618-a871-f9ec886020a4/dom_md/metadata bs=4096 count=1' (cwd None) Thread-27::DEBUG::2014-01-05 19:12:42,464::fileSD::239::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n495 bytes (495 B) copied, 0.000345681 s, 1.4 MB/s\n'; <rc> = 0 Thread-302::DEBUG::2014-01-05 19:12:42,484::BindingXMLRPC::177::vds::(wrapper) client [5.39.66.85] Thread-302::DEBUG::2014-01-05 19:12:42,485::task::579::TaskManager.Task::(_updateState) Task=`acd727ae-dcbf-4662-bd97-fbdbadf6968a`::moving from state init -> state preparing Thread-302::INFO::2014-01-05 19:12:42,485::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='2eceb484-73e0-464a-965b-69f067918080', options=None) Thread-302::INFO::2014-01-05 19:12:42,485::logUtils::47::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': 2, 'spmStatus': 'SPM', 'spmLver': 3}} thanks in advance a -- Amedeo Salvati RHC{DS,E,VA} - LPIC-3 - UCP - NCLA 11 m. +39 333 1264484 email: amedeo@oscert.net email: amedeo@linux.com http://plugcomputing.it/redhatcert.php http://plugcomputing.it/lpicert.php
participants (1)
-
Amedeo Salvati