"Please activate the master Storage Domain first"

This is a multi-part message in MIME format. --------------020201080903010807000608 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit The DC storage master domain is on a (unrecoverable) storage on a remote dead host. Engine is automatically setting another storage as the "Data (Master)". Seconds later, the unrecoverable storage is marked as "Data (Master)" again. There is no way to start the Datacenter. Both storages are gluster. The old (unrecoverable) one worked fine as a master. Any hint? Logs: Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state init -> state preparing Thread-32620::INFO::2015-04-28 16:34:02,508::logUtils::48::dispatcher::(wrapper) Run and protect: getAllTasksStatuses(spUUID=None, options=None) Thread-32620::ERROR::2015-04-28 16:34:02,508::task::863::Storage.TaskManager.Task::(_setError) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 870, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2202, in getAllTasksStatuses raise se.SpmStatusError() SpmStatusError: Not SPM: () Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::882::Storage.TaskManager.Task::(_run) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Task._run: bf487090-8d62-4b42-bfd e-93574a8e1486 () {} failed - stopping task Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::1214::Storage.TaskManager.Task::(stop) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::stopping in state preparing (for ce False) Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::990::Storage.TaskManager.Task::(_decref) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::ref 1 aborting True Thread-32620::INFO::2015-04-28 16:34:02,508::task::1168::Storage.TaskManager.Task::(prepare) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::aborting: Task is aborted: 'No t SPM' - code 654 Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::1173::Storage.TaskManager.Task::(prepare) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Prepare: aborted: Not SPM Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::990::Storage.TaskManager.Task::(_decref) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::ref 0 aborting True Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::925::Storage.TaskManager.Task::(_doAbort) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Task._doAbort: force False Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state preparing -> state aborting Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::547::Storage.TaskManager.Task::(__state_aborting) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::_aborting: recover policy none Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state aborting -> state failed Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-32620::ERROR::2015-04-28 16:34:02,509::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': 'Not SPM: ()', 'code': 654}} Thread-32620::DEBUG::2015-04-28 16:34:02,509::stompReactor::158::yajsonrpc.StompServer::(send) Sending response --------------020201080903010807000608 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> </head> <body bgcolor="#FFFFFF" text="#000000"> The DC storage master domain is on a (unrecoverable) storage on a remote dead host.<br> Engine is automatically setting another storage as the "Data (Master)".<br> Seconds later, the unrecoverable storage is marked as "Data (Master)" again.<br> There is no way to start the Datacenter.<br> <br> Both storages are gluster. The old (unrecoverable) one worked fine as a master.<br> <br> Any hint?<br> <br> Logs:<br> <br> <blockquote>Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state init -><br> state preparing<br> Thread-32620::INFO::2015-04-28 16:34:02,508::logUtils::48::dispatcher::(wrapper) Run and protect: getAllTasksStatuses(spUUID=None, options=None)<br> Thread-32620::ERROR::2015-04-28 16:34:02,508::task::863::Storage.TaskManager.Task::(_setError) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Unexpected error<br> Traceback (most recent call last):<br> File "/usr/share/vdsm/storage/task.py", line 870, in _run<br> return fn(*args, **kargs)<br> File "/usr/share/vdsm/logUtils.py", line 49, in wrapper<br> res = f(*args, **kwargs)<br> File "/usr/share/vdsm/storage/hsm.py", line 2202, in getAllTasksStatuses<br> raise se.SpmStatusError()<br> SpmStatusError: Not SPM: ()<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::882::Storage.TaskManager.Task::(_run) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Task._run: bf487090-8d62-4b42-bfd<br> e-93574a8e1486 () {} failed - stopping task<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::1214::Storage.TaskManager.Task::(stop) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::stopping in state preparing (for<br> ce False)<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::990::Storage.TaskManager.Task::(_decref) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::ref 1 aborting True<br> Thread-32620::INFO::2015-04-28 16:34:02,508::task::1168::Storage.TaskManager.Task::(prepare) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::aborting: Task is aborted: 'No<br> t SPM' - code 654<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::1173::Storage.TaskManager.Task::(prepare) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Prepare: aborted: Not SPM<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::990::Storage.TaskManager.Task::(_decref) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::ref 0 aborting True<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::925::Storage.TaskManager.Task::(_doAbort) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::Task._doAbort: force False<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state preparing -> state aborting<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::547::Storage.TaskManager.Task::(__state_aborting) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::_aborting: recover policy none<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::task::592::Storage.TaskManager.Task::(_updateState) Task=`bf487090-8d62-4b42-bfde-93574a8e1486`::moving from state aborting -> state failed<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}<br> Thread-32620::DEBUG::2015-04-28 16:34:02,508::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}<br> Thread-32620::ERROR::2015-04-28 16:34:02,509::dispatcher::76::Storage.Dispatcher::(wrapper) {'status': {'message': 'Not SPM: ()', 'code': 654}}<br> Thread-32620::DEBUG::2015-04-28 16:34:02,509::stompReactor::158::yajsonrpc.StompServer::(send) Sending response<br> </blockquote> </body> </html> --------------020201080903010807000608--

The DC storage master domain is on a (unrecoverable) storage on a remote dead host. Engine is automatically setting another storage as the "Data (Master)". Seconds later, the unrecoverable storage is marked as "Data (Master)" again. There is no way to start the Datacenter.
Both storages are gluster. The old (unrecoverable) one worked fine as a master.
This may be related to this bug: https://bugzilla.redhat.com/1183977. Are you using latest engine?
Any hint?
If a one gluster node dies, and this brings down your data center, your gluster is probably not set up correctly. With proper replication everything should work after a storage node dies. Please check this for the recommended configuration: http://www.ovirt.org/Gluster_Storage_Domain_Reference Nir

On 28-04-2015 18:14, Nir Soffer wrote:
The DC storage master domain is on a (unrecoverable) storage on a remote dead host. Engine is automatically setting another storage as the "Data (Master)". Seconds later, the unrecoverable storage is marked as "Data (Master)" again. There is no way to start the Datacenter.
Both storages are gluster. The old (unrecoverable) one worked fine as a master. This may be related to this bug: https://bugzilla.redhat.com/1183977.
Ok. I added a comment and explained more in detail the issue on BZ.
Are you using latest engine?
Yes, ovirt-engine-3.6.0-0.0.master.20150427175110.git61dec8c.el7.centos.noarch
Any hint? If a one gluster node dies, and this brings down your data center, your gluster is probably not set up correctly. With proper replication everything should work after a storage node dies.
Right, in theory vdsm, ovirt-engine and gluster should all be stable enough so that the Master Storage Domain is always alive. Besides, oVirt DC admins should know that a Master Storage Domain can not be removed or firewalled out from the DC without loosing the whole DC. From another point of view, oVirt should be rock solid even in the case Master Storage Domain went down. It should not rely on a single SD but choose other available SD as the new master SD, and that's the way it seems to be implemented (though not always working). Expected result : the alive SD should become the new MSD to reactivate the DC. Issue : Engine tries to set the alive SD as the new MSD but fails without a reason.
Please check this for the recommended configuration: http://www.ovirt.org/Gluster_Storage_Domain_Reference Thanks. Yes, we are applying replica 3 on "production". On our lab, funny things are happening all the time with the master nightly builds and latest gluster builds, but this helps to test and fix issues on the run and generate extreme test cases making oVirt more robust.
Regards, Chris

----- Original Message -----
From: "Christopher Pereira" <kripper@imatronix.cl> To: "Nir Soffer" <nsoffer@redhat.com> Cc: devel@ovirt.org, "Liron Aravot" <laravot@redhat.com> Sent: Wednesday, April 29, 2015 6:14:32 AM Subject: Re: [ovirt-devel] "Please activate the master Storage Domain first"
On 28-04-2015 18:14, Nir Soffer wrote:
The DC storage master domain is on a (unrecoverable) storage on a remote dead host. Engine is automatically setting another storage as the "Data (Master)". Seconds later, the unrecoverable storage is marked as "Data (Master)" again. There is no way to start the Datacenter.
Both storages are gluster. The old (unrecoverable) one worked fine as a master. This may be related to this bug: https://bugzilla.redhat.com/1183977.
Ok. I added a comment and explained more in detail the issue on BZ.
Are you using latest engine?
Yes, ovirt-engine-3.6.0-0.0.master.20150427175110.git61dec8c.el7.centos.noarch
Any hint? If a one gluster node dies, and this brings down your data center, your gluster is probably not set up correctly. With proper replication everything should work after a storage node dies.
Right, in theory vdsm, ovirt-engine and gluster should all be stable enough so that the Master Storage Domain is always alive. Besides, oVirt DC admins should know that a Master Storage Domain can not be removed or firewalled out from the DC without loosing the whole DC. From another point of view, oVirt should be rock solid even in the case Master Storage Domain went down. It should not rely on a single SD but choose other available SD as the new master SD, and that's the way it seems to be implemented (though not always working). Expected result : the alive SD should become the new MSD to reactivate the DC. Issue : Engine tries to set the alive SD as the new MSD but fails without a reason.
Please check this for the recommended configuration: http://www.ovirt.org/Gluster_Storage_Domain_Reference Thanks. Yes, we are applying replica 3 on "production". On our lab, funny things are happening all the time with the master nightly builds and latest gluster builds, but this helps to test and fix issues on the run and generate extreme test cases making oVirt more robust.
Regards, Chris
Hi Chris, Can you please attach the engine/vdsm logs from the time the issue occurred? thanks.

This is a multi-part message in MIME format. --------------000705000900010300040900 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 29-04-2015 11:56, Liron Aravot wrote:
From: "Christopher Pereira" <kripper@imatronix.cl> To: "Nir Soffer" <nsoffer@redhat.com> Cc: devel@ovirt.org, "Liron Aravot" <laravot@redhat.com> Sent: Wednesday, April 29, 2015 6:14:32 AM Subject: Re: [ovirt-devel] "Please activate the master Storage Domain first"
On 28-04-2015 18:14, Nir Soffer wrote:
The DC storage master domain is on a (unrecoverable) storage on a remote dead host. Engine is automatically setting another storage as the "Data (Master)". Seconds later, the unrecoverable storage is marked as "Data (Master)" again. There is no way to start the Datacenter.
Both storages are gluster. The old (unrecoverable) one worked fine as a master. This may be related to this bug: https://bugzilla.redhat.com/1183977. Ok. I added a comment and explained more in detail the issue on BZ.
[...] Hi Chris, Can you please attach the engine/vdsm logs from the time the issue occurred?
----- Original Message ----- thanks.
Hi, relevant VDSM logs were published here: https://bugzilla.redhat.com/show_bug.cgi?id=1183977 And engine says "Failed in 'ReconstructMasterVDS' method" because of "Cannot acquire host id: (u'ba7be27f-aee5-4436-ae9a-0764f551f9a7', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error'))": 2015-04-28 16:34:01,915 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-56) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand' return value ' TaskStatusListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=654, mMessage=Not SPM: ()]] ' 2015-04-28 16:34:01,915 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-56) [] HostName = H4 2015-04-28 16:34:01,915 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-56) [] Command 'HSMGetAllTasksStatusesVDSCommand(HostName = H4, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='0146bec9-9e8a-4611-a2ca-e4f3c8ccdb26'})' execution failed: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM: () 2015-04-28 16:34:01,951 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-56) [] hostFromVds::selectedVds - 'H6', spmStatus 'Unknown_Pool', storage pool 'Default', storage pool version '3.6' 2015-04-28 16:34:01,953 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-56) [] START, ConnectStoragePoolVDSCommand(HostName = H6, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='75e32383-90dd-42c5-9bc9-73b6e5333fe1'}, vdsId = 75e32383-90dd-42c5-9bc9-73b6e5333fe1, storagePoolId = 00000001-0001-0001-0001-00000000007e, masterVersion = 2625), log id: 15842e4f 2015-04-28 16:34:02,035 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-56) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand' return value 'StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=304, mMessage=Cannot find master domain: u'spUUID=00000001-0001-0001-0001-00000000007e, msdUUID=3233144b-7be1-445f-9ea6-6aebbacbb93f']]' 2015-04-28 16:34:02,036 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-56) [] HostName = H6 2015-04-28 16:34:02,036 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-56) [] Command 'ConnectStoragePoolVDSCommand(HostName = H6, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='75e32383-90dd-42c5-9bc9-73b6e5333fe1'}, vdsId = 75e32383-90dd-42c5-9bc9-73b6e5333fe1, storagePoolId = 00000001-0001-0001-0001-00000000007e, masterVersion = 2625)' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: u'spUUID=00000001-0001-0001-0001-00000000007e, msdUUID=3233144b-7be1-445f-9ea6-6aebbacbb93f' 2015-04-28 16:34:02,036 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-56) [] FINISH, ConnectStoragePoolVDSCommand, log id: 15842e4f 2015-04-28 16:34:02,036 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-56) [] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: u'spUUID=00000001-0001-0001-0001-00000000007e, msdUUID=3233144b-7be1-445f-9ea6-6aebbacbb93f' 2015-04-28 16:34:02,137 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : ID: 3233144b-7be1-445f-9ea6-6aebbacbb93f Type: Storage 2015-04-28 16:34:02,235 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] START, DisconnectStoragePoolVDSCommand(HostName = H6, DisconnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='75e32383-90dd-42c5-9bc9-73b6e5333fe1'}, storagePoolId = 00000001-0001-0001-0001-00000000007e, vds_spm_id = 3), log id: 782490bb 2015-04-28 16:34:02,239 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] FINISH, DisconnectStoragePoolVDSCommand, log id: 782490bb 2015-04-28 16:34:02,240 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] START, ReconstructMasterVDSCommand(HostName = H6, ReconstructMasterVDSCommandParameters:{runAsync='true', hostId='75e32383-90dd-42c5-9bc9-73b6e5333fe1'}, vdsSpmId = 3, storagePoolId = 00000001-0001-0001-0001-00000000007e, storagePoolName = Default, masterDomainId = ba7be27f-aee5-4436-ae9a-0764f551f9a7, masterVersion = 2626, domainsList = [{ domainId: ad08adbc-b3a7-4198-a1cb-4ad70c724454, status: Unknown };{ domainId: 15674257-e14a-4693-9b95-cc4bb92aa765, status: Unknown };{ domainId: ba7be27f-aee5-4436-ae9a-0764f551f9a7, status: Unknown };{ domainId: 3233144b-7be1-445f-9ea6-6aebbacbb93f, status: Inactive };]), log id: 626fc611 2015-04-28 16:34:03,318 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Failed in 'ReconstructMasterVDS' method 2015-04-28 16:34:03,318 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand' return value 'StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=661, mMessage=Cannot acquire host id: (u'ba7be27f-aee5-4436-ae9a-0764f551f9a7', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error'))]]' 2015-04-28 16:34:03,318 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] HostName = H6 2015-04-28 16:34:03,318 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command 'ReconstructMasterVDSCommand(HostName = H6, ReconstructMasterVDSCommandParameters:{runAsync='true', hostId='75e32383-90dd-42c5-9bc9-73b6e5333fe1'}, vdsSpmId = 3, storagePoolId = 00000001-0001-0001-0001-00000000007e, storagePoolName = Default, masterDomainId = ba7be27f-aee5-4436-ae9a-0764f551f9a7, masterVersion = 2626, domainsList = [{ domainId: ad08adbc-b3a7-4198-a1cb-4ad70c724454, status: Unknown };{ domainId: 15674257-e14a-4693-9b95-cc4bb92aa765, status: Unknown };{ domainId: ba7be27f-aee5-4436-ae9a-0764f551f9a7, status: Unknown };{ domainId: 3233144b-7be1-445f-9ea6-6aebbacbb93f, status: Inactive };])' execution failed: VDSGenericException: VDSErrorException: Failed to ReconstructMasterVDS, error = Cannot acquire host id: (u'ba7be27f-aee5-4436-ae9a-0764f551f9a7', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error')), code = 661 2015-04-28 16:34:03,319 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] FINISH, ReconstructMasterVDSCommand, log id: 626fc611 2015-04-28 16:34:03,319 ERROR [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command 'org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand' failed: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ReconstructMasterVDS, error = Cannot acquire host id: (u'ba7be27f-aee5-4436-ae9a-0764f551f9a7', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error')), code = 661 (Failed with error AcquireHostIdFailure and code 661) 2015-04-28 16:34:03,320 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command [id=449f2035-96bb-4b70-867c-05a7179dc75b]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: EntityStatusSnapshot:{id='storagePoolId = 00000001-0001-0001-0001-00000000007e, storageId = 3233144b-7be1-445f-9ea6-6aebbacbb93f', status='Unknown'}. 2015-04-28 16:34:03,321 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command [id=449f2035-96bb-4b70-867c-05a7179dc75b]: Compensating UPDATED_ONLY_ENTITY of org.ovirt.engine.core.common.businessentities.StorageDomainStatic; snapshot: StorageDomainStatic[vdisks-ssd, 3233144b-7be1-445f-9ea6-6aebbacbb93f]. 2015-04-28 16:34:03,321 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command [id=449f2035-96bb-4b70-867c-05a7179dc75b]: Compensating UPDATED_ONLY_ENTITY of org.ovirt.engine.core.common.businessentities.StorageDomainStatic; snapshot: StorageDomainStatic[vdisks, ba7be27f-aee5-4436-ae9a-0764f551f9a7]. 2015-04-28 16:34:03,370 INFO [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (org.ovirt.thread.pool-8-thread-6) [] Finished reconstruct for pool '00000001-0001-0001-0001-00000000007e'. Clearing event queue 2015-04-28 16:34:04,154 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder] (DefaultQuartzScheduler_Worker-65) [c1df557] Can't set graphics data from XML. 2015-04-28 16:34:04,155 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] START, DestroyVDSCommand(HostName = H4, DestroyVmVDSCommandParameters:{runAsync='true', hostId='0146bec9-9e8a-4611-a2ca-e4f3c8ccdb26', vmId='6ec9c0a0-2879-4bfe-9a79-92471881ebfe', force='false', secondsToWait='0', gracefully='false', reason=''}), log id: 238de0d4 2015-04-28 16:34:04,158 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] Unexpected return value: StatusForXmlRpc [mCode=-32603, mMessage=u'6ec9c0a0-2879-4bfe-9a79-92471881ebfe'] 2015-04-28 16:34:04,158 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] Failed in 'DestroyVDS' method 2015-04-28 16:34:04,158 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=-32603, mMessage=u'6ec9c0a0-2879-4bfe-9a79-92471881ebfe']]' 2015-04-28 16:34:04,158 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] HostName = H4 2015-04-28 16:34:04,158 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] Command 'DestroyVDSCommand(HostName = H4, DestroyVmVDSCommandParameters:{runAsync='true', hostId='0146bec9-9e8a-4611-a2ca-e4f3c8ccdb26', vmId='6ec9c0a0-2879-4bfe-9a79-92471881ebfe', force='false', secondsToWait='0', gracefully='false', reason=''})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = u'6ec9c0a0-2879-4bfe-9a79-92471881ebfe', code = -32603 2015-04-28 16:34:04,158 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] FINISH, DestroyVDSCommand, log id: 238de0d4 DC is now up, because I attached the old MSD, but I could try reproducing the failing command if you want. How do I specify the dataDict in the vdsClient command? I guess it should be something like: vdsClient -s 0 reconstructMaster '00000001-0001-0001-0001-00000000007e' 'Default' 'ba7be27f-aee5-4436-ae9a-0764f551f9a7' '[{ domainId: ad08adbc-b3a7-4198-a1cb-4ad70c724454, status: Unknown };{ domainId: 15674257-e14a-4693-9b95-cc4bb92aa765, status: Unknown };{ domainId: ba7be27f-aee5-4436-ae9a-0764f551f9a7, status: Unknown };{ domainId: 3233144b-7be1-445f-9ea6-6aebbacbb93f, status: Inactive };]' Maybe related with sanlock giving me "add_lockspace fail result -5" all the time. Best regards, Chris --------------000705000900010300040900 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <br> <div class="moz-cite-prefix">On 29-04-2015 11:56, Liron Aravot wrote:<br> </div> <blockquote cite="mid:1347581894.5682332.1430319376995.JavaMail.zimbra@redhat.com" type="cite"> <pre wrap=""> ----- Original Message ----- </pre> <blockquote type="cite"> <pre wrap="">From: "Christopher Pereira" <a class="moz-txt-link-rfc2396E" href="mailto:kripper@imatronix.cl"><kripper@imatronix.cl></a> To: "Nir Soffer" <a class="moz-txt-link-rfc2396E" href="mailto:nsoffer@redhat.com"><nsoffer@redhat.com></a> Cc: <a class="moz-txt-link-abbreviated" href="mailto:devel@ovirt.org">devel@ovirt.org</a>, "Liron Aravot" <a class="moz-txt-link-rfc2396E" href="mailto:laravot@redhat.com"><laravot@redhat.com></a> Sent: Wednesday, April 29, 2015 6:14:32 AM Subject: Re: [ovirt-devel] "Please activate the master Storage Domain first" On 28-04-2015 18:14, Nir Soffer wrote: </pre> <blockquote type="cite"> <blockquote type="cite"> <pre wrap="">The DC storage master domain is on a (unrecoverable) storage on a remote dead host. Engine is automatically setting another storage as the "Data (Master)". Seconds later, the unrecoverable storage is marked as "Data (Master)" again. There is no way to start the Datacenter. Both storages are gluster. The old (unrecoverable) one worked fine as a master. </pre> </blockquote> <pre wrap="">This may be related to this bug: <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/1183977">https://bugzilla.redhat.com/1183977</a>. </pre> </blockquote> <pre wrap=""> Ok. I added a comment and explained more in detail the issue on BZ. [...] </pre> </blockquote> <pre wrap=""> Hi Chris, Can you please attach the engine/vdsm logs from the time the issue occurred? thanks.</pre> </blockquote> <br> Hi, relevant VDSM logs were published here:<br> <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1183977">https://bugzilla.redhat.com/show_bug.cgi?id=1183977</a><br> <br> And engine says "Failed in 'ReconstructMasterVDS' method" because of "Cannot acquire host id: (u'ba7be27f-aee5-4436-ae9a-0764f551f9a7', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error'))":<br> <blockquote>2015-04-28 16:34:01,915 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-56) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand' return value '<br> TaskStatusListReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=654, mMessage=Not SPM: ()]]<br> '<br> 2015-04-28 16:34:01,915 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-56) [] HostName = H4<br> 2015-04-28 16:34:01,915 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-56) [] Command 'HSMGetAllTasksStatusesVDSCommand(HostName = H4, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='0146bec9-9e8a-4611-a2ca-e4f3c8ccdb26'})' execution failed: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM: ()<br> 2015-04-28 16:34:01,951 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-56) [] hostFromVds::selectedVds - 'H6', spmStatus 'Unknown_Pool', storage pool 'Default', storage pool version '3.6'<br> 2015-04-28 16:34:01,953 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-56) [] START, ConnectStoragePoolVDSCommand(HostName = H6, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='75e32383-90dd-42c5-9bc9-73b6e5333fe1'}, vdsId = 75e32383-90dd-42c5-9bc9-73b6e5333fe1, storagePoolId = 00000001-0001-0001-0001-00000000007e, masterVersion = 2625), log id: 15842e4f<br> 2015-04-28 16:34:02,035 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-56) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand' return value 'StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=304, mMessage=Cannot find master domain: u'spUUID=00000001-0001-0001-0001-00000000007e, msdUUID=3233144b-7be1-445f-9ea6-6aebbacbb93f']]'<br> 2015-04-28 16:34:02,036 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-56) [] HostName = H6<br> 2015-04-28 16:34:02,036 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-56) [] Command 'ConnectStoragePoolVDSCommand(HostName = H6, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='75e32383-90dd-42c5-9bc9-73b6e5333fe1'}, vdsId = 75e32383-90dd-42c5-9bc9-73b6e5333fe1, storagePoolId = 00000001-0001-0001-0001-00000000007e, masterVersion = 2625)' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: u'spUUID=00000001-0001-0001-0001-00000000007e, msdUUID=3233144b-7be1-445f-9ea6-6aebbacbb93f'<br> 2015-04-28 16:34:02,036 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (DefaultQuartzScheduler_Worker-56) [] FINISH, ConnectStoragePoolVDSCommand, log id: 15842e4f<br> 2015-04-28 16:34:02,036 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-56) [] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: u'spUUID=00000001-0001-0001-0001-00000000007e, msdUUID=3233144b-7be1-445f-9ea6-6aebbacbb93f'<br> 2015-04-28 16:34:02,137 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : ID: 3233144b-7be1-445f-9ea6-6aebbacbb93f Type: Storage<br> 2015-04-28 16:34:02,235 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] START, DisconnectStoragePoolVDSCommand(HostName = H6, DisconnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='75e32383-90dd-42c5-9bc9-73b6e5333fe1'}, storagePoolId = 00000001-0001-0001-0001-00000000007e, vds_spm_id = 3), log id: 782490bb<br> 2015-04-28 16:34:02,239 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] FINISH, DisconnectStoragePoolVDSCommand, log id: 782490bb<br> 2015-04-28 16:34:02,240 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] START, ReconstructMasterVDSCommand(HostName = H6, ReconstructMasterVDSCommandParameters:{runAsync='true', hostId='75e32383-90dd-42c5-9bc9-73b6e5333fe1'}, vdsSpmId = 3, storagePoolId = 00000001-0001-0001-0001-00000000007e, storagePoolName = Default, masterDomainId = ba7be27f-aee5-4436-ae9a-0764f551f9a7, masterVersion = 2626, domainsList = [{ domainId: ad08adbc-b3a7-4198-a1cb-4ad70c724454, status: Unknown };{ domainId: 15674257-e14a-4693-9b95-cc4bb92aa765, status: Unknown };{ domainId: ba7be27f-aee5-4436-ae9a-0764f551f9a7, status: Unknown };{ domainId: 3233144b-7be1-445f-9ea6-6aebbacbb93f, status: Inactive };]), log id: 626fc611<br> 2015-04-28 16:34:03,318 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Failed in 'ReconstructMasterVDS' method<br> 2015-04-28 16:34:03,318 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand' return value 'StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=661, mMessage=Cannot acquire host id: (u'ba7be27f-aee5-4436-ae9a-0764f551f9a7', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error'))]]'<br> 2015-04-28 16:34:03,318 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] HostName = H6<br> 2015-04-28 16:34:03,318 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command 'ReconstructMasterVDSCommand(HostName = H6, ReconstructMasterVDSCommandParameters:{runAsync='true', hostId='75e32383-90dd-42c5-9bc9-73b6e5333fe1'}, vdsSpmId = 3, storagePoolId = 00000001-0001-0001-0001-00000000007e, storagePoolName = Default, masterDomainId = ba7be27f-aee5-4436-ae9a-0764f551f9a7, masterVersion = 2626, domainsList = [{ domainId: ad08adbc-b3a7-4198-a1cb-4ad70c724454, status: Unknown };{ domainId: 15674257-e14a-4693-9b95-cc4bb92aa765, status: Unknown };{ domainId: ba7be27f-aee5-4436-ae9a-0764f551f9a7, status: Unknown };{ domainId: 3233144b-7be1-445f-9ea6-6aebbacbb93f, status: Inactive };])' execution failed: VDSGenericException: VDSErrorException: Failed to ReconstructMasterVDS, error = Cannot acquire host id: (u'ba7be27f-aee5-4436-ae9a-0764f551f9a7', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error')), code = 661<br> 2015-04-28 16:34:03,319 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] FINISH, ReconstructMasterVDSCommand, log id: 626fc611<br> 2015-04-28 16:34:03,319 ERROR [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command 'org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand' failed: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to ReconstructMasterVDS, error = Cannot acquire host id: (u'ba7be27f-aee5-4436-ae9a-0764f551f9a7', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error')), code = 661 (Failed with error AcquireHostIdFailure and code 661)<br> 2015-04-28 16:34:03,320 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command [id=449f2035-96bb-4b70-867c-05a7179dc75b]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: EntityStatusSnapshot:{id='storagePoolId = 00000001-0001-0001-0001-00000000007e, storageId = 3233144b-7be1-445f-9ea6-6aebbacbb93f', status='Unknown'}.<br> 2015-04-28 16:34:03,321 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command [id=449f2035-96bb-4b70-867c-05a7179dc75b]: Compensating UPDATED_ONLY_ENTITY of org.ovirt.engine.core.common.businessentities.StorageDomainStatic; snapshot: StorageDomainStatic[vdisks-ssd, 3233144b-7be1-445f-9ea6-6aebbacbb93f].<br> 2015-04-28 16:34:03,321 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-8-thread-25) [1339ee5] Command [id=449f2035-96bb-4b70-867c-05a7179dc75b]: Compensating UPDATED_ONLY_ENTITY of org.ovirt.engine.core.common.businessentities.StorageDomainStatic; snapshot: StorageDomainStatic[vdisks, ba7be27f-aee5-4436-ae9a-0764f551f9a7].<br> 2015-04-28 16:34:03,370 INFO [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (org.ovirt.thread.pool-8-thread-6) [] Finished reconstruct for pool '00000001-0001-0001-0001-00000000007e'. Clearing event queue<br> 2015-04-28 16:34:04,154 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder] (DefaultQuartzScheduler_Worker-65) [c1df557] Can't set graphics data from XML.<br> 2015-04-28 16:34:04,155 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] START, DestroyVDSCommand(HostName = H4, DestroyVmVDSCommandParameters:{runAsync='true', hostId='0146bec9-9e8a-4611-a2ca-e4f3c8ccdb26', vmId='6ec9c0a0-2879-4bfe-9a79-92471881ebfe', force='false', secondsToWait='0', gracefully='false', reason=''}), log id: 238de0d4<br> 2015-04-28 16:34:04,158 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] Unexpected return value: StatusForXmlRpc [mCode=-32603, mMessage=u'6ec9c0a0-2879-4bfe-9a79-92471881ebfe']<br> 2015-04-28 16:34:04,158 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] Failed in 'DestroyVDS' method<br> 2015-04-28 16:34:04,158 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=-32603, mMessage=u'6ec9c0a0-2879-4bfe-9a79-92471881ebfe']]'<br> 2015-04-28 16:34:04,158 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] HostName = H4<br> 2015-04-28 16:34:04,158 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] Command 'DestroyVDSCommand(HostName = H4, DestroyVmVDSCommandParameters:{runAsync='true', hostId='0146bec9-9e8a-4611-a2ca-e4f3c8ccdb26', vmId='6ec9c0a0-2879-4bfe-9a79-92471881ebfe', force='false', secondsToWait='0', gracefully='false', reason=''})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = u'6ec9c0a0-2879-4bfe-9a79-92471881ebfe', code = -32603<br> 2015-04-28 16:34:04,158 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-65) [c1df557] FINISH, DestroyVDSCommand, log id: 238de0d4<br> </blockquote> DC is now up, because I attached the old MSD, but I could try reproducing the failing command if you want.<br> <br> How do I specify the dataDict in the vdsClient command?<br> <br> I guess it should be something like:<br> <blockquote>vdsClient -s 0 reconstructMaster '00000001-0001-0001-0001-00000000007e' 'Default' 'ba7be27f-aee5-4436-ae9a-0764f551f9a7' '[{ domainId: ad08adbc-b3a7-4198-a1cb-4ad70c724454, status: Unknown };{ domainId: 15674257-e14a-4693-9b95-cc4bb92aa765, status: Unknown };{ domainId: ba7be27f-aee5-4436-ae9a-0764f551f9a7, status: Unknown };{ domainId: 3233144b-7be1-445f-9ea6-6aebbacbb93f, status: Inactive };]'<br> </blockquote> Maybe related with sanlock giving me "add_lockspace fail result -5" all the time.<br> <br> Best regards,<br> Chris<br> </body> </html> --------------000705000900010300040900--

On 29-04-2015 11:56, Liron Aravot wrote:
Can you please attach the engine/vdsm logs from the time the issue occurred? thanks.
Hi, relevant VDSM logs were published here: https://bugzilla.redhat.com/show_bug.cgi?id=1183977
And engine says "Failed in 'ReconstructMasterVDS' method" because of "Cannot acquire host id: (u'ba7be27f-aee5-4436-ae9a-0764f551f9a7', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error'))":
If reconstruct master fails, it is expected that the old master will be restored. Your probably have storage issues, revealed by sanlock - it reads and write to all storage domains every 10 seconds, so flaky storage will cause failure to acquire a host id. Please attach to the bug these logs: Hypervisor: /var/log/sanlock.log /var/log/messages /var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server: server logs showing what happened when sanlock failed access the gluster volume.
How do I specify the dataDict in the vdsClient command?
I guess it should be something like:
vdsClient -s 0 reconstructMaster '00000001-0001-0001-0001-00000000007e' 'Default' 'ba7be27f-aee5-4436-ae9a-0764f551f9a7' '[{ domainId: ad08adbc-b3a7-4198-a1cb-4ad70c724454, status: Unknown };{ domainId: 15674257-e14a-4693-9b95-cc4bb92aa765, status: Unknown };{ domainId: ba7be27f-aee5-4436-ae9a-0764f551f9a7, status: Unknown };{ domainId: 3233144b-7be1-445f-9ea6-6aebbacbb93f, status: Inactive };]'
Looking in cleint/vdsClient.py:920 - it should be: ad08adbc-b3a7-4198-a1cb-4ad70c724454=status,15674257-e14a-4693-9b95-cc4bb92aa765=status,... Where status is the string vdsm expects. You can check what vdsm expects or what engine is sending. Nir

This is a multi-part message in MIME format. --------------040602070308070206010302 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Hi Nir / Liron, On 29-04-2015 16:41, Nir Soffer wrote:
[...] Your probably have storage issues, revealed by sanlock - it reads and write to all storage domains every 10 seconds, so flaky storage will cause failure to acquire a host id. Please attach to the bug these logs: Hypervisor: /var/log/sanlock.log /var/log/messages /var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server: server logs showing what happened when sanlock failed access the gluster volume.
We have 3 hosts with a 1 GbE LAN connection (idle) so it's probably soft related. On host H5, we are constantly receiving sanlock warnings: May 1 17:26:37 h5 sanlock[637]: 2015-05-01 17:26:37-0300 9118 [643]: s909 add_lockspace fail result -5 May 1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128 [12892]: read_sectors delta_leader offset 512 rv -5 /rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids I thought it was normal, but want to be sure. A Gluster Statedump reveals this locks: [xlator.features.locks.vdisks-locks.inode] path=/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids mandatory=0 conn.1.id=<Host H5>-3016-2015/05/01-17:54:57:109200-vdisks-client-0-0-0 conn.1.ref=1 conn.1.bound_xl=/mnt/disk1/gluster-bricks/vdisks conn.2.id=<Host H6>-3369-2015/04/30-05:40:59:928550-vdisks-client-0-0-0 conn.2.ref=1 conn.2.bound_xl=/mnt/disk1/gluster-bricks/vdisks conn.3.id=<Host H4>-31780-2015/04/30-05:57:15:152009-vdisks-client-0-0-0 conn.3.ref=1 conn.3.bound_xl=/mnt/disk1/gluster-bricks/vdisks conn.4.id=<Host H6>-16034-2015/04/30-16:40:26:355759-vdisks-client-0-0-0 conn.4.ref=1 conn.4.bound_xl=/mnt/disk1/gluster-bricks/vdisks The first lease is hold by sanlock: # ps aux | grep 637 sanlock 637 0.0 0.0 545240 19172 ? SLsl 14:54 0:01 sanlock daemon -U sanlock -G sanlock Host is up since 14:54:39, so this lease was taken after the boot. Threads: |-sanlock(637)-+-sanlock-helper(641) | |-{sanlock}(642) | |-{sanlock}(643) | |-{sanlock}(644) | |-{sanlock}(645) | |-{sanlock}(3281) | `-{sanlock}(3428) Strace: h5:~# strace -p 637 Process 637 attached restart_syscall(<... resuming interrupted call ...>) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}]) accept(7, 0, NULL) = 11 setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}]) recvfrom(11, "\20 (\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32 futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN|POLLHUP}]) recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0 close(11) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}]) accept(7, 0, NULL) = 11 setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}]) recvfrom(11, "\20 (\4\1\0\0\0\2\0\0\0\1\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32 futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) futex(0x7f0f3fa97640, FUTEX_WAKE_PRIVATE, 1) = 1 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN|POLLHUP}]) recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0 close(11) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}]) accept(7, 0, NULL) = 11 setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN}]) recvfrom(11, "\20 (\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32 futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN|POLLHUP}]) recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0 close(11) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 824) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000^CProcess 637 detached Maybe I'm doing something wrong? Do you have any idea? The funny thing is that this CentOS 7 host is getting constantly in a state where the whole root FS is not accesible any more. All you have is a bash returning I/O errors for any stored command (eg: "ls"). I thought it was hardware related (2 x 500 GB SSD disks), but maybe we are lucky and found something kernel related. Best regards, Christopher. --------------040602070308070206010302 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> Hi Nir / Liron,<br> <br> <div class="moz-cite-prefix">On 29-04-2015 16:41, Nir Soffer wrote:<br> </div> <blockquote cite="mid:733821506.7219611.1430336460762.JavaMail.zimbra@redhat.com" type="cite"> [...]<br> Your probably have storage issues, revealed by sanlock - it reads and write to all storage domains every 10 seconds, so flaky storage will cause failure to acquire a host id. Please attach to the bug these logs: Hypervisor: /var/log/sanlock.log /var/log/messages /var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server: server logs showing what happened when sanlock failed access the gluster volume. </blockquote> <br> We have 3 hosts with a 1 GbE LAN connection (idle) so it's probably soft related.<br> <br> On host H5, we are constantly receiving sanlock warnings:<br> <blockquote>May 1 17:26:37 h5 sanlock[637]: 2015-05-01 17:26:37-0300 9118 [643]: s909 add_lockspace fail result -5<br> May 1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128 [12892]: read_sectors delta_leader offset 512 rv -5 /rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids<br> </blockquote> I thought it was normal, but want to be sure.<br> <br> A Gluster Statedump reveals this locks:<br> <blockquote>[xlator.features.locks.vdisks-locks.inode]<br> path=/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids<br> mandatory=0<br> conn.1.id=<Host H5>-3016-2015/05/01-17:54:57:109200-vdisks-client-0-0-0<br> conn.1.ref=1<br> conn.1.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br> conn.2.id=<Host H6>-3369-2015/04/30-05:40:59:928550-vdisks-client-0-0-0<br> conn.2.ref=1<br> conn.2.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br> conn.3.id=<Host H4>-31780-2015/04/30-05:57:15:152009-vdisks-client-0-0-0<br> conn.3.ref=1<br> conn.3.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br> conn.4.id=<Host H6>-16034-2015/04/30-16:40:26:355759-vdisks-client-0-0-0<br> conn.4.ref=1<br> conn.4.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br> </blockquote> The first lease is hold by sanlock:<br> <blockquote># ps aux | grep 637<br> sanlock 637 0.0 0.0 545240 19172 ? SLsl 14:54 0:01 sanlock daemon -U sanlock -G sanlock<br> </blockquote> Host is up since 14:54:39, so this lease was taken after the boot.<br> <br> Threads:<br> <br> |-sanlock(637)-+-sanlock-helper(641)<br> | |-{sanlock}(642)<br> | |-{sanlock}(643)<br> | |-{sanlock}(644)<br> | |-{sanlock}(645)<br> | |-{sanlock}(3281)<br> | `-{sanlock}(3428)<br> <br> Strace:<br> <br> <blockquote>h5:~# strace -p 637<br> Process 637 attached<br> restart_syscall(<... resuming interrupted call ...>) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])<br> accept(7, 0, NULL) = 11<br> setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}])<br> recvfrom(11, "\20 (\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32<br> futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1<br> futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN|POLLHUP}])<br> recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0<br> close(11) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])<br> accept(7, 0, NULL) = 11<br> setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}])<br> recvfrom(11, "\20 (\4\1\0\0\0\2\0\0\0\1\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32<br> futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1<br> futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> futex(0x7f0f3fa97640, FUTEX_WAKE_PRIVATE, 1) = 1<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN|POLLHUP}])<br> recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0<br> close(11) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])<br> accept(7, 0, NULL) = 11<br> setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN}])<br> recvfrom(11, "\20 (\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32<br> futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1<br> futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN|POLLHUP}])<br> recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0<br> close(11) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 824) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000^CProcess 637 detached<br> </blockquote> Maybe I'm doing something wrong?<br> Do you have any idea?<br> <br> The funny thing is that this CentOS 7 host is getting constantly in a state where the whole root FS is not accesible any more.<br> All you have is a bash returning I/O errors for any stored command (eg: "ls").<br> I thought it was hardware related (2 x 500 GB SSD disks), but maybe we are lucky and found something kernel related.<br> <br> Best regards,<br> <br> Christopher.<br> </body> </html> --------------040602070308070206010302--

On 29-04-2015 16:41, Nir Soffer wrote:
[...] Your probably have storage issues, revealed by sanlock - it reads and write to all storage domains every 10 seconds, so flaky storage will cause failure to acquire a host id. Please attach to the bug these logs: Hypervisor: /var/log/sanlock.log /var/log/messages /var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server: server logs showing what happened when sanlock failed access the gluster volume.
Do you plan to provide the requested logs? Or maybe explain why they are not needed?
We have 3 hosts with a 1 GbE LAN connection (idle) so it's probably soft related.
On host H5, we are constantly receiving sanlock warnings:
May 1 17:26:37 h5 sanlock[637]: 2015-05-01 17:26:37-0300 9118 [643]: s909 add_lockspace fail result -5 May 1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128 [12892]: read_sectors delta_leader offset 512 rv -5 /rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids
David, can you explain what is the meaning of "rv -5"?
I thought it was normal, but want to be sure.
Any error in sanlock is not normal, and getting it constantly is bad sign.
A Gluster Statedump reveals this locks:
[xlator.features.locks.vdisks-locks.inode] path=/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids mandatory=0 conn.1.id=<Host H5>-3016-2015/05/01-17:54:57:109200-vdisks-client-0-0-0 conn.1.ref=1 conn.1.bound_xl=/mnt/disk1/gluster-bricks/vdisks conn.2.id=<Host H6>-3369-2015/04/30-05:40:59:928550-vdisks-client-0-0-0 conn.2.ref=1 conn.2.bound_xl=/mnt/disk1/gluster-bricks/vdisks conn.3.id=<Host H4>-31780-2015/04/30-05:57:15:152009-vdisks-client-0-0-0 conn.3.ref=1 conn.3.bound_xl=/mnt/disk1/gluster-bricks/vdisks conn.4.id=<Host H6>-16034-2015/04/30-16:40:26:355759-vdisks-client-0-0-0 conn.4.ref=1 conn.4.bound_xl=/mnt/disk1/gluster-bricks/vdisks
All hosts are maintaining the host lease on the dom_md/ids file, so this is expected that gluster detect that all hosts are accessing the ids file.
Host is up since 14:54:39, so this lease was taken after the boot.
A lease is lost after reboot, or if host cannot access the storage for about 80 seconds (depending on sanlock configuration). sanlock update the lease very 10 seconds. If a lease is not updated, other hosts consider the lease as DEAD. On the machine holding the lease, sanlock will terminate the process holding the lease, or reboot the machine if the process cannot be terminated.
The funny thing is that this CentOS 7 host is getting constantly in a state where the whole root FS is not accesible any more. All you have is a bash returning I/O errors for any stored command (eg: "ls"). I thought it was hardware related (2 x 500 GB SSD disks), but maybe we are lucky and found something kernel related.
And the local storage on this host is used for gluster brick - right? If you are using replica 3, I would expect gluster to work reliably even when this host loose access to its storage, but I guess that gluster is not designed for such flaky storage. Did you open a glusterfs bug for this? You should fix this host so it does not loose access to its storage. I don't think that setup where storage is lost constantly is useful for anything. Nir

Hi Nir, On 02-05-2015 9:27, Nir Soffer wrote:
On 29-04-2015 16:41, Nir Soffer wrote:
[...] Your probably have storage issues, revealed by sanlock - it reads and write to all storage domains every 10 seconds, so flaky storage will cause failure to acquire a host id. Please attach to the bug these logs: Hypervisor: /var/log/sanlock.log /var/log/messages /var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server: server logs showing what happened when sanlock failed access the gluster volume. Do you plan to provide the requested logs? Or maybe explain why they are not needed? I'm sorry. The sanlock issue was happening on a replica-2 volume, not replica-3 as we expected (our main volume). Since replica-2 is not supported and time is valuable, we will resetup and restart our tests. On our lab we did really mad things like killing sanlock or just rebooting the host, mixing replica-3 and replica-2 volumes, etc.
Any error in sanlock is not normal, and getting it constantly is bad sign. Good to know. We will be tracking this. This probably explains many of the issues we had.
And the local storage on this host is used for gluster brick - right? Yes. Is this supported? I read on BZ there was an issue with that.
Did you open a glusterfs bug for this? Yes, but I will ask to close it since replica-2 is not our focus and my hypothesis that sanlock shouldn't be able to block the gluster FS was wrong: https://bugzilla.redhat.com/show_bug.cgi?id=1217576
BTW, does oVirt work fine with a local replica-1 (no split-brains, no other peers) gluster storage?

May 1 17:26:37 h5 sanlock[637]: 2015-05-01 17:26:37-0300 9118 [643]: s909 add_lockspace fail result -5 May 1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128 [12892]: read_sectors delta_leader offset 512 rv -5 /rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids
David, can you explain what is the meaning of "rv -5"?
EIO is returned when sanlock reads the file.
participants (4)
-
Christopher Pereira
-
David Teigland
-
Liron Aravot
-
Nir Soffer