[Users] oVirt 3.3 gluster volume active but unable to activate domain

One engine with f19 and two nodes with f19. All with ovirt stable repo for f19. DC defined as GlusterFS The volume is ok, but I can't activate the domain. Relevant logs when I clich activate are below On engine: 2013-10-03 23:05:10,332 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServicesListVDSCommand] (pool-6-thread-50) START, GlusterServicesListVDSCommand(HostName = f18ovn03, HostId = b67bcfd4-f868-49d5-8704-4936ee922249), log id: 5704c54f 2013-10-03 23:05:12,121 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-83) hostFromVds::selectedVds - f18ovn01, spmStatus Free, storage pool Gluster 2013-10-03 23:05:12,142 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-83) SpmStatus on vds 80188ccc-83b2-4bc8-9385-8d07f7458a3c: Free 2013-10-03 23:05:12,144 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-83) starting spm on vds f18ovn01, storage pool Gluster, prevId 1, LVER 0 2013-10-03 23:05:12,148 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServicesListVDSCommand] (pool-6-thread-46) FINISH, GlusterServicesListVDSCommand, return: [org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@955283ba, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@1ef87397, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@c1b996b6, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@30199726, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@606c4879, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@2b860d38, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@f69fd1f7], log id: 4a1b4d33 2013-10-03 23:05:12,159 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-83) START, SpmStartVDSCommand(HostName = f18ovn01, HostId = 80188ccc-83b2-4bc8-9385-8d07f7458a3c, storagePoolId = eb679feb-4da2-4fd0-a185-abbe459ffa70, prevId=1, prevLVER=0, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 62f11f2d 2013-10-03 23:05:12,169 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-83) spmStart polling started: taskId = ab9f2f84-f89b-44e9-b508-a904420635f4 2013-10-03 23:05:12,232 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServicesListVDSCommand] (pool-6-thread-50) FINISH, GlusterServicesListVDSCommand, return: [org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@b624c19b, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@3fcab178, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@e28bd497, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@50ebd507, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@813e865a, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@4c584b19, org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@17720fd8], log id: 5704c54f 2013-10-03 23:05:12,512 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler_Worker-6) START, GlusterVolumesListVDSCommand(HostName = f18ovn01, HostId = 80188ccc-83b2-4bc8-9385-8d07f7458a3c), log id: 39a3f45d 2013-10-03 23:05:12,595 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler_Worker-6) FINISH, GlusterVolumesListVDSCommand, return: {97873e57-0cc2-4740-ae38-186a8dd94718=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@a82da199, d055b38c-2754-4e53-af5c-69cc0b8bf31c=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@ef0c0180}, log id: 39a3f45d 2013-10-03 23:05:14,182 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-83) Failed in HSMGetTaskStatusVDS method 2013-10-03 23:05:14,184 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-83) Error code AcquireHostIdFailure and error message VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot acquire host id 2013-10-03 23:05:14,186 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-83) spmStart polling ended: taskId = ab9f2f84-f89b-44e9-b508-a904420635f4 task status = finished 2013-10-03 23:05:14,188 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-83) Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot acquire host id 2013-10-03 23:05:14,214 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-83) spmStart polling ended, spm status: Free 2013-10-03 23:05:14,228 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-83) START, HSMClearTaskVDSCommand(HostName = f18ovn01, HostId = 80188ccc-83b2-4bc8-9385-8d07f7458a3c, taskId=ab9f2f84-f89b-44e9-b508-a904420635f4), log id: 6e5df07f 2013-10-03 23:05:14,235 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-83) FINISH, HSMClearTaskVDSCommand, log id: 6e5df07f 2013-10-03 23:05:14,237 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-83) FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@5191523, log id: 62f11f2d 2013-10-03 23:05:14,239 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-83) Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: eb679feb-4da2-4fd0-a185-abbe459ffa70 Type: StoragePool 2013-10-03 23:05:14,258 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-83) IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed 2013-10-03 23:05:14,307 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-83) Irs placed on server 80188ccc-83b2-4bc8-9385-8d07f7458a3c failed. Proceed Failover 2013-10-03 23:05:14,341 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-83) hostFromVds::selectedVds - f18ovn03, spmStatus Free, storage pool Gluster 2013-10-03 23:05:14,345 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-83) SPM Init: could not find reported vds or not up - pool:Gluster vds_spm_id: 1 2013-10-03 23:05:14,360 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-83) SPM selection - vds seems as spm f18ovn01 2013-10-03 23:05:14,375 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler_Worker-83) START, SpmStopVDSCommand(HostName = f18ovn01, HostId = 80188ccc-83b2-4bc8-9385-8d07f7458a3c, storagePoolId = eb679feb-4da2-4fd0-a185-abbe459ffa70), log id: 59e35f64 2013-10-03 23:05:14,384 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-83) Command org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand return value On first vdsm host: Thread-22567::DEBUG::2013-10-03 23:05:00,996::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-22567::DEBUG::2013-10-03 23:05:00,996::task::974::TaskManager.Task::(_decref) Task=`b00f597e-e73c-49a1-9bb5-14a 5d800c32e`::ref 0 aborting False 4a4a72d1-b1db-4498-aac1-549c986fde24::ERROR::2013-10-03 23:05:01,042::task::850::TaskManager.Task::(_setError) Task=`4a4a72d1-b1db-4498-aac1-549c986fde24`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 318, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/sp.py", line 272, in startSpm self.masterDomain.acquireHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error')) On second vdsm host: Thread-15262::DEBUG::2013-10-03 23:05:23,327::task::1168::TaskManager.Task::(prepare) Task=`654b1197-b3a0-4cff-9859-6515c5a908f4`::finished: {'taskStatus': {'code': 0, 'message': 'Task is initializing', 'taskState': 'running', 'taskResult': '', 'taskID': 'f8ada879-f322-4135-9803-b5f8217ac58a'}} Thread-15262::DEBUG::2013-10-03 23:05:23,327::task::579::TaskManager.Task::(_updateState) Task=`654b1197-b3a0-4cff-9859-6515c5a908f4`::moving from state preparing -> state finished Thread-15262::DEBUG::2013-10-03 23:05:23,327::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-15262::DEBUG::2013-10-03 23:05:23,328::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-15262::DEBUG::2013-10-03 23:05:23,328::task::974::TaskManager.Task::(_decref) Task=`654b1197-b3a0-4cff-9859-6515c5a908f4`::ref 0 aborting False f8ada879-f322-4135-9803-b5f8217ac58a::ERROR::2013-10-03 23:05:23,356::task::850::TaskManager.Task::(_setError) Task=`f8ada879-f322-4135-9803-b5f8217ac58a`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 318, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/sp.py", line 272, in startSpm self.masterDomain.acquireHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291', SanlockException(5, 'Sanlock lockspace add failure', 'Input/output error')) f8ada879-f322-4135-9803-b5f8217ac58a::DEBUG::2013-10-03 23:05:23,358::task::869::TaskManager.Task::(_run) Task=`f8ada879-f322-4135-9803-b5f8217ac58a`::Task._run: f8ada879-f322-4135-9803-b5f8217ac58a () {} failed - stopping task f8ada879-f322-4135-9803-b5f8217ac58a::DEBUG::2013-10-03 23:05:23,358::task::1194::TaskManager.Task::(stop) Task=`f8ada879-f322-4135-9803-b5f8217ac58a`::stopping in state running (force False) ANy hint? Thanks, Gianluca

On messages of vdsm hosts Oct 3 23:05:57 f18ovn03 sanlock[1146]: 2013-10-03 23:05:57+0200 16624 [13543]: read_sectors delta_leader offset 512 rv -5 /rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids Oct 3 23:05:58 f18ovn03 sanlock[1146]: 2013-10-03 23:05:58+0200 16625 [1155]: s2142 add_lockspace fail result -5 Oct 3 23:04:24 f18ovn01 sanlock[1166]: 2013-10-03 23:04:24+0200 16154 [1172]: s2688 add_lockspace fail result -5 Oct 3 23:04:24 f18ovn01 vdsm TaskManager.Task ERROR Task=`bd6b0848-4550-483e-9002-e3051a2e1074`::Unexpected error Oct 3 23:04:25 f18ovn01 vdsm TaskManager.Task ERROR Task=`f2ac595d-cc9d-4125-a7be-7b8706cc9ee3`::Unexpected error Oct 3 23:04:25 f18ovn01 vdsm TaskManager.Task ERROR Task=`6c03be35-57f1-405e-b127-bd708defad67`::Unexpected error Oct 3 23:04:26 f18ovn01 sanlock[1166]: 2013-10-03 23:04:26+0200 16157 [21348]: read_sectors delta_leader offset 0 rv -5 /rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids Oct 3 23:04:27 f18ovn01 sanlock[1166]: 2013-10-03 23:04:27+0200 16158 [1172]: s2689 add_lockspace fail result -5 But the origin possibly is a split brain detected at gluster level I see it in rhev-data-center-mnt-glusterSD-f18ovn01.mydomain:gvdata.log this afternoon around the time I had installed the first guest and ran a shutdown and a power on. see: https://docs.google.com/file/d/0BwoPbcrMv8mvNHNOVlNrOFFabjQ/edit?usp=sharing Why gluster logs are two hours behind? UTC? Any way to set them with the current system time? Gianluca

And in fact after solving the split brain, the gluster domain automatically activated.
From rhev-data-center-mnt-glusterSD-f18ovn01.mydomain:gvdata.log under /var/log/glusterfs I found ids file was the one not in sync As the VM only started on f18ovn03 and I was not able to migrate to f18ovn01, I decided to delete the file form f18ovn01.
BTW: what does dom_md/ids contain? [2013-10-03 22:06:33.543730] E [afr-self-heal-common.c:197:afr_sh_print_split_brain_log] 0-gvdata-replicate-0: Unable to self-heal contents of '/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids' (possible split-brain). Please delete the file from all but the preferred subvolume.- Pending matrix: [ [ 0 2 ] [ 2 0 ] ] [2013-10-03 22:06:33.544013] E [afr-self-heal-common.c:2212:afr_self_heal_completion_cbk] 0-gvdata-replicate-0: background data self-heal failed on /d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids [2013-10-03 22:06:33.544522] W [afr-open.c:213:afr_open] 0-gvdata-replicate-0: failed to open as split brain seen, returning EIO [2013-10-03 22:06:33.544603] W [page.c:991:__ioc_page_error] 0-gvdata-io-cache: page error for page = 0x7f4b80004910 & waitq = 0x7f4b8001da60 [2013-10-03 22:06:33.544635] W [fuse-bridge.c:2049:fuse_readv_cbk] 0-glusterfs-fuse: 132995: READ => -1 (Input/output error) [2013-10-03 22:06:33.545070] W [client-lk.c:367:delete_granted_locks_owner] 0-gvdata-client-0: fdctx not valid [2013-10-03 22:06:33.545118] W [client-lk.c:367:delete_granted_locks_owner] 0-gvdata-client-1: fdctx not valid I found that gluster creates hard links, so you have to delete all copies of conflicting file from the brick directory of the node you choose to delete from. Thanks very much to this link: http://inuits.eu/blog/fixing-glusterfs-split-brain So these my steps: locate [root@f18ovn01 d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291]# find /gluster/DATA_GLUSTER/brick1/ -samefile /gluster/DATA_GLUSTER/brick1/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids -print /gluster/DATA_GLUSTER/brick1/.glusterfs/ae/27/ae27eb8d-c653-4cc0-a054-ea376ce8097d /gluster/DATA_GLUSTER/brick1/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids and then delete both [root@f18ovn01 d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291]# find /gluster/DATA_GLUSTER/brick1/ -samefile /gluster/DATA_GLUSTER/brick1/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids -print -delete /gluster/DATA_GLUSTER/brick1/.glusterfs/ae/27/ae27eb8d-c653-4cc0-a054-ea376ce8097d /gluster/DATA_GLUSTER/brick1/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids An after this step no more " E " lines in gluster log and gluster domain automatically activated by engine. Gianluca

On 10/04/2013 12:54 AM, Gianluca Cecchi wrote:
On messages of vdsm hosts
Oct 3 23:05:57 f18ovn03 sanlock[1146]: 2013-10-03 23:05:57+0200 16624 [13543]: read_sectors delta_leader offset 512 rv -5 /rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids Oct 3 23:05:58 f18ovn03 sanlock[1146]: 2013-10-03 23:05:58+0200 16625 [1155]: s2142 add_lockspace fail result -5
Oct 3 23:04:24 f18ovn01 sanlock[1166]: 2013-10-03 23:04:24+0200 16154 [1172]: s2688 add_lockspace fail result -5 Oct 3 23:04:24 f18ovn01 vdsm TaskManager.Task ERROR Task=`bd6b0848-4550-483e-9002-e3051a2e1074`::Unexpected error Oct 3 23:04:25 f18ovn01 vdsm TaskManager.Task ERROR Task=`f2ac595d-cc9d-4125-a7be-7b8706cc9ee3`::Unexpected error Oct 3 23:04:25 f18ovn01 vdsm TaskManager.Task ERROR Task=`6c03be35-57f1-405e-b127-bd708defad67`::Unexpected error Oct 3 23:04:26 f18ovn01 sanlock[1166]: 2013-10-03 23:04:26+0200 16157 [21348]: read_sectors delta_leader offset 0 rv -5 /rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids Oct 3 23:04:27 f18ovn01 sanlock[1166]: 2013-10-03 23:04:27+0200 16158 [1172]: s2689 add_lockspace fail result -5
But the origin possibly is a split brain detected at gluster level I see it in rhev-data-center-mnt-glusterSD-f18ovn01.mydomain:gvdata.log this afternoon around the time I had installed the first guest and ran a shutdown and a power on.
see: https://docs.google.com/file/d/0BwoPbcrMv8mvNHNOVlNrOFFabjQ/edit?usp=sharing
Why gluster logs are two hours behind? UTC? Any way to set them with the current system time?
Gianluca _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
do you happen to know why it got into a split brain status?

On Fri, Oct 4, 2013 at 11:30 AM, Itamar Heim wrote:
do you happen to know why it got into a split brain status?
I suppose with the fist migration attempt. But last night I replicated but I only got the heal for ids file, that gluster apparently solved itself. Possibly the bug about ide can influence the migration problem and the possible gluster split brain? What does it contain dom_md/ids file and when does it get updated? During install of OS, and "yum update" it seems all was ok and no heal about disk file itself.... SO I presume the metadata update not correctly managed someway could induce gluster coherence on this "ids" file.... Gianluca
participants (2)
-
Gianluca Cecchi
-
Itamar Heim