Re: [Users] Data Center stuck between "Non Responsive" and "Contending"

On 1/26/2014 4:00 PM, Itamar Heim wrote:
On 01/26/2014 10:51 PM, Ted Miller wrote:
On 1/26/2014 3:10 PM, Itamar Heim wrote:
On 01/26/2014 10:08 PM, Ted Miller wrote:
My Data Center is down, and won't come back up.
Data Center Status on the GUI flips between "Non Responsive" and "Contending"
Also noted: Host sometimes seen flipping between "Low" and "Contending" in SPM column. Storage VM2 "Data (Master)" is in "Cross Data-Center Status" = Unknown VM2 is "up" under "Volumes" tab
Created another volume for VM storage. It shows up in "volumes" tab, but when I try to add "New Domain" in storage tab, says that "There are No Data Centers to which the Storage Domain can be attached"
Setup: 2 hosts w/ glusterfs storage 1 engine all 3 computers Centos 6.5, just updated ovirt-engine 3.3.0.1-1.el6 ovirt-engine-lib 3.3.2-1.el6 ovirt-host-deploy.noarch 1.1.3-1.el6 glusterfs.x86_64 3.4.2-1.el6
This loop seems to repeat in the ovirt-engine log (grep of log showing only DefaultQuartzScheduler_Worker-79 thread:
2014-01-26 14:44:58,416 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) Irs placed on server 9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover 2014-01-26 14:44:58,511 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds - office4a, spmStatus Free, storage pool mill 2014-01-26 14:44:58,550 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) SpmStatus on vds 127ed939-34af-41a8-87a0-e2f6174b1877: Free 2014-01-26 14:44:58,571 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) starting spm on vds office4a, storage pool mill, prevId 2, LVER 15 2014-01-26 14:44:58,579 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) START, SpmStartVDSCommand(HostName = office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877, storagePoolId = 536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 74c38eb7 2014-01-26 14:44:58,617 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) spmStart polling started: taskId = e8986753-fc80-4b11-a11d-6d3470b1728c 2014-01-26 14:45:00,662 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS method 2014-01-26 14:45:00,664 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-79) Error code AcquireHostIdFailure and error message VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot acquire host id 2014-01-26 14:45:00,665 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId = e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished 2014-01-26 14:45:00,666 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot acquire host id 2014-01-26 14:45:00,695 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm status: Free 2014-01-26 14:45:00,702 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-79) START, HSMClearTaskVDSCommand(HostName = office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877, taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6 2014-01-26 14:45:00,722 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-79) FINISH, HSMClearTaskVDSCommand, log id: 336ec5a6 2014-01-26 14:45:00,724 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@13652652, log id: 74c38eb7 2014-01-26 14:45:00,733 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-79) Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool 2014-01-26 14:45:00,778 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed
Ted Miller Elkhart, IN, USA
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
is this gluster storage (guessing sunce you mentioned a 'volume') yes (mentioned under "setup" above) does it have a quorum? Volume Name: VM2 Type: Replicate Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.41.65.2:/bricks/01/VM2 Brick2: 10.41.65.4:/bricks/01/VM2 Brick3: 10.41.65.4:/bricks/101/VM2 Options Reconfigured: cluster.server-quorum-type: server storage.owner-gid: 36 storage.owner-uid: 36 auth.allow: * user.cifs: off nfs.disa (there were reports of split brain on the domain metadata before when no quorum exist for gluster) after full heal:
[root@office4a ~]$ gluster volume heal VM2 info Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0 [root@office4a ~]$ gluster volume heal VM2 info split-brain Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0
noticed this in host /var/log/messages (while looking for something else). Loop seems to repeat over and over.
Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678 [30419]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679 [3771]: s1997 add_lockspace fail result -90 Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2111, in getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 raise SecureError()#012SecureError Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686 [30495]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687 [3772]: s1998 add_lockspace fail result -90 Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/task.py", line 318, in run#012 return self.cmd(*self.argslist, **self.argsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acquireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock lockspace add failure', 'Message too long'))
Ted Miller Elkhart, IN, USA
this is the new storage domain? what about the previous volume for the first SD?
The default/default data center/cluster had to be abandoned because of a split-brain that could not be healed. Can't remove old storage from database, can't get data center up due to corrupt storage, ends up a circular argument. I started over with same hosts, totally new storage in new data center. This mill/one data center/cluster was working fine with VM2 storage, then died. Ted Miller

On 01/27/2014 12:00 AM, Ted Miller wrote:
On 1/26/2014 4:00 PM, Itamar Heim wrote:
On 01/26/2014 10:51 PM, Ted Miller wrote:
On 1/26/2014 3:10 PM, Itamar Heim wrote:
On 01/26/2014 10:08 PM, Ted Miller wrote:
My Data Center is down, and won't come back up.
Data Center Status on the GUI flips between "Non Responsive" and "Contending"
Also noted: Host sometimes seen flipping between "Low" and "Contending" in SPM column. Storage VM2 "Data (Master)" is in "Cross Data-Center Status" = Unknown VM2 is "up" under "Volumes" tab
Created another volume for VM storage. It shows up in "volumes" tab, but when I try to add "New Domain" in storage tab, says that "There are No Data Centers to which the Storage Domain can be attached"
Setup: 2 hosts w/ glusterfs storage 1 engine all 3 computers Centos 6.5, just updated ovirt-engine 3.3.0.1-1.el6 ovirt-engine-lib 3.3.2-1.el6 ovirt-host-deploy.noarch 1.1.3-1.el6 glusterfs.x86_64 3.4.2-1.el6
This loop seems to repeat in the ovirt-engine log (grep of log showing only DefaultQuartzScheduler_Worker-79 thread:
2014-01-26 14:44:58,416 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) Irs placed on server 9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover 2014-01-26 14:44:58,511 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds - office4a, spmStatus Free, storage pool mill 2014-01-26 14:44:58,550 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) SpmStatus on vds 127ed939-34af-41a8-87a0-e2f6174b1877: Free 2014-01-26 14:44:58,571 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) starting spm on vds office4a, storage pool mill, prevId 2, LVER 15 2014-01-26 14:44:58,579 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) START, SpmStartVDSCommand(HostName = office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877, storagePoolId = 536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 74c38eb7 2014-01-26 14:44:58,617 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) spmStart polling started: taskId = e8986753-fc80-4b11-a11d-6d3470b1728c 2014-01-26 14:45:00,662 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS method 2014-01-26 14:45:00,664 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-79) Error code AcquireHostIdFailure and error message VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot acquire host id 2014-01-26 14:45:00,665 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId = e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished 2014-01-26 14:45:00,666 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot acquire host id 2014-01-26 14:45:00,695 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm status: Free 2014-01-26 14:45:00,702 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-79) START, HSMClearTaskVDSCommand(HostName = office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877, taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6 2014-01-26 14:45:00,722 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-79) FINISH, HSMClearTaskVDSCommand, log id: 336ec5a6 2014-01-26 14:45:00,724 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@13652652,
log id: 74c38eb7 2014-01-26 14:45:00,733 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-79) Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool 2014-01-26 14:45:00,778 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed
Ted Miller Elkhart, IN, USA
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
is this gluster storage (guessing sunce you mentioned a 'volume') yes (mentioned under "setup" above) does it have a quorum? Volume Name: VM2 Type: Replicate Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.41.65.2:/bricks/01/VM2 Brick2: 10.41.65.4:/bricks/01/VM2 Brick3: 10.41.65.4:/bricks/101/VM2 Options Reconfigured: cluster.server-quorum-type: server storage.owner-gid: 36 storage.owner-uid: 36 auth.allow: * user.cifs: off nfs.disa (there were reports of split brain on the domain metadata before when no quorum exist for gluster) after full heal:
[root@office4a ~]$ gluster volume heal VM2 info Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0 [root@office4a ~]$ gluster volume heal VM2 info split-brain Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0
noticed this in host /var/log/messages (while looking for something else). Loop seems to repeat over and over.
Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678 [30419]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679 [3771]: s1997 add_lockspace fail result -90 Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2111, in getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 raise SecureError()#012SecureError Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686 [30495]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687 [3772]: s1998 add_lockspace fail result -90 Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/task.py", line 318, in run#012 return self.cmd(*self.argslist, **self.argsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acquireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock lockspace add failure', 'Message too long'))
fede - thoughts on above? (vojtech reported something similar, but it sorted out for him after some retries)
Ted Miller Elkhart, IN, USA
this is the new storage domain? what about the previous volume for the first SD?
The default/default data center/cluster had to be abandoned because of a split-brain that could not be healed. Can't remove old storage from database, can't get data center up due to corrupt storage, ends up a circular argument.
I started over with same hosts, totally new storage in new data center. This mill/one data center/cluster was working fine with VM2 storage, then died.
Ted Miller
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

--------------030406020605090601060306 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit On 1/26/2014 5:17 PM, Itamar Heim wrote:
On 01/27/2014 12:00 AM, Ted Miller wrote:
On 1/26/2014 4:00 PM, Itamar Heim wrote:
On 01/26/2014 10:51 PM, Ted Miller wrote:
On 1/26/2014 3:10 PM, Itamar Heim wrote:
On 01/26/2014 10:08 PM, Ted Miller wrote:
My Data Center is down, and won't come back up.
Data Center Status on the GUI flips between "Non Responsive" and "Contending"
Also noted: Host sometimes seen flipping between "Low" and "Contending" in SPM column. Storage VM2 "Data (Master)" is in "Cross Data-Center Status" = Unknown VM2 is "up" under "Volumes" tab
Created another volume for VM storage. It shows up in "volumes" tab, but when I try to add "New Domain" in storage tab, says that "There are No Data Centers to which the Storage Domain can be attached"
Setup: 2 hosts w/ glusterfs storage 1 engine all 3 computers Centos 6.5, just updated ovirt-engine 3.3.0.1-1.el6 ovirt-engine-lib 3.3.2-1.el6 ovirt-host-deploy.noarch 1.1.3-1.el6 glusterfs.x86_64 3.4.2-1.el6
This loop seems to repeat in the ovirt-engine log (grep of log showing only DefaultQuartzScheduler_Worker-79 thread:
2014-01-26 14:44:58,416 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) Irs placed on server 9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover 2014-01-26 14:44:58,511 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds - office4a, spmStatus Free, storage pool mill 2014-01-26 14:44:58,550 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) SpmStatus on vds 127ed939-34af-41a8-87a0-e2f6174b1877: Free 2014-01-26 14:44:58,571 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) starting spm on vds office4a, storage pool mill, prevId 2, LVER 15 2014-01-26 14:44:58,579 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) START, SpmStartVDSCommand(HostName = office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877, storagePoolId = 536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 74c38eb7 2014-01-26 14:44:58,617 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) spmStart polling started: taskId = e8986753-fc80-4b11-a11d-6d3470b1728c 2014-01-26 14:45:00,662 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS method 2014-01-26 14:45:00,664 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-79) Error code AcquireHostIdFailure and error message VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot acquire host id 2014-01-26 14:45:00,665 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId = e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished 2014-01-26 14:45:00,666 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot acquire host id 2014-01-26 14:45:00,695 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm status: Free 2014-01-26 14:45:00,702 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-79) START, HSMClearTaskVDSCommand(HostName = office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877, taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6 2014-01-26 14:45:00,722 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (DefaultQuartzScheduler_Worker-79) FINISH, HSMClearTaskVDSCommand, log id: 336ec5a6 2014-01-26 14:45:00,724 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@13652652,
log id: 74c38eb7 2014-01-26 14:45:00,733 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-79) Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool 2014-01-26 14:45:00,778 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-79) IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed
Ted Miller Elkhart, IN, USA
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
is this gluster storage (guessing sunce you mentioned a 'volume') yes (mentioned under "setup" above) does it have a quorum? Volume Name: VM2 Type: Replicate Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.41.65.2:/bricks/01/VM2 Brick2: 10.41.65.4:/bricks/01/VM2 Brick3: 10.41.65.4:/bricks/101/VM2 Options Reconfigured: cluster.server-quorum-type: server storage.owner-gid: 36 storage.owner-uid: 36 auth.allow: * user.cifs: off nfs.disa (there were reports of split brain on the domain metadata before when no quorum exist for gluster) after full heal:
[root@office4a ~]$ gluster volume heal VM2 info Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0 [root@office4a ~]$ gluster volume heal VM2 info split-brain Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0
noticed this in host /var/log/messages (while looking for something else). Loop seems to repeat over and over.
Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678 [30419]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679 [3771]: s1997 add_lockspace fail result -90 Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2111, in getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 raise SecureError()#012SecureError Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686 [30495]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687 [3772]: s1998 add_lockspace fail result -90 Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/task.py", line 318, in run#012 return self.cmd(*self.argslist, **self.argsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acquireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock lockspace add failure', 'Message too long'))
fede - thoughts on above? (vojtech reported something similar, but it sorted out for him after some retries)
Ted Miller Elkhart, IN, USA
this is the new storage domain? what about the previous volume for the first SD?
The default/default data center/cluster had to be abandoned because of a split-brain that could not be healed. Can't remove old storage from database, can't get data center up due to corrupt storage, ends up a circular argument.
I started over with same hosts, totally new storage in new data center. This mill/one data center/cluster was working fine with VM2 storage, then died.
Ted Miller Looking at the problems with Sanlock, I read the page on it, then went poking around.
I find the sanlock.log is full of: 2014-01-26 17:26:30-0500 21316 [3772]: s3071 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 2014-01-26 17:26:30-0500 21316 [11733]: 0322a407 aio collect 0 0x7f97380008c0:0x7f97380008d0:0x7f9738101000 result 0:0 match len 512 2014-01-26 17:26:30-0500 21316 [11733]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids 2014-01-26 17:26:31-0500 21317 [3772]: s3071 add_lockspace fail result -90 Just looking at the log, it looks like sanlock is trying to "read_sectors delta_leader offset 512", probably in this file: /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids Well, I can see why it fails -- the ids file is empty. [root@office4a ~]$ ls /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ -l total 1029 -rw-rw---- 1 vdsm kvm 0 Jan 22 00:44 ids -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 inbox -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases -rw-r--r-- 1 vdsm kvm 491 Jan 21 18:20 metadata -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 outbox I have no idea why it is empty, or what is supposed to be there, but anything trying to read with an offset of 512 is going to kick back an error on a 0 length file. :( I recall that on the previous ids file, on the domain that went up in smoke (improperly recovered split-brain), the file was about 1 megabyte. The date on the 'ids' file is probably about when the Data Center went belly-up. I hope this helps someone figure out what I can do to get out of this endless circle of trying to start the Data Center. Maybe there is a way to force an initialization of the ids file, and then everyone could be happy? Ted Miller --------------030406020605090601060306 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> On 1/26/2014 5:17 PM, Itamar Heim wrote:<br> <blockquote cite="mid:52E58960.4060607@redhat.com" type="cite">On 01/27/2014 12:00 AM, Ted Miller wrote: <br> <blockquote type="cite"> <br> On 1/26/2014 4:00 PM, Itamar Heim wrote: <br> <blockquote type="cite">On 01/26/2014 10:51 PM, Ted Miller wrote: <br> <blockquote type="cite"> <br> On 1/26/2014 3:10 PM, Itamar Heim wrote: <br> <blockquote type="cite">On 01/26/2014 10:08 PM, Ted Miller wrote: <br> <blockquote type="cite">My Data Center is down, and won't come back up. <br> <br> Data Center Status on the GUI flips between "Non Responsive" and <br> "Contending" <br> <br> Also noted: <br> Host sometimes seen flipping between "Low" and "Contending" in SPM <br> column. <br> Storage VM2 "Data (Master)" is in "Cross Data-Center Status" = Unknown <br> VM2 is "up" under "Volumes" tab <br> <br> Created another volume for VM storage. It shows up in "volumes" tab, <br> but when I try to add "New Domain" in storage tab, says that "There <br> are <br> No Data Centers to which the Storage Domain can be attached" <br> <br> Setup: <br> 2 hosts w/ glusterfs storage <br> 1 engine <br> all 3 computers Centos 6.5, just updated <br> ovirt-engine 3.3.0.1-1.el6 <br> ovirt-engine-lib 3.3.2-1.el6 <br> ovirt-host-deploy.noarch 1.1.3-1.el6 <br> glusterfs.x86_64 3.4.2-1.el6 <br> <br> This loop seems to repeat in the ovirt-engine log (grep of log showing <br> only DefaultQuartzScheduler_Worker-79 thread: <br> <br> 2014-01-26 14:44:58,416 INFO <br> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] <br> (DefaultQuartzScheduler_Worker-79) Irs placed on server <br> 9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover <br> 2014-01-26 14:44:58,511 INFO <br> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] <br> (DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds - <br> office4a, <br> spmStatus Free, storage pool mill <br> 2014-01-26 14:44:58,550 INFO <br> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] <br> (DefaultQuartzScheduler_Worker-79) SpmStatus on vds <br> 127ed939-34af-41a8-87a0-e2f6174b1877: Free <br> 2014-01-26 14:44:58,571 INFO <br> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] <br> (DefaultQuartzScheduler_Worker-79) starting spm on vds office4a, <br> storage <br> pool mill, prevId 2, LVER 15 <br> 2014-01-26 14:44:58,579 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) START, <br> SpmStartVDSCommand(HostName = <br> office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877, <br> storagePoolId = <br> 536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15, <br> storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log <br> id: 74c38eb7 <br> 2014-01-26 14:44:58,617 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) spmStart polling started: taskId = <br> e8986753-fc80-4b11-a11d-6d3470b1728c <br> 2014-01-26 14:45:00,662 ERROR <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS <br> method <br> 2014-01-26 14:45:00,664 ERROR <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) Error code AcquireHostIdFailure and <br> error message VDSGenericException: VDSErrorException: Failed to <br> HSMGetTaskStatusVDS, error = Cannot acquire host id <br> 2014-01-26 14:45:00,665 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId = <br> e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished <br> 2014-01-26 14:45:00,666 ERROR <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) Start SPM Task failed - result: <br> cleanSuccess, message: VDSGenericException: VDSErrorException: <br> Failed to <br> HSMGetTaskStatusVDS, error = Cannot acquire host id <br> 2014-01-26 14:45:00,695 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm <br> status: Free <br> 2014-01-26 14:45:00,702 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) START, <br> HSMClearTaskVDSCommand(HostName = office4a, HostId = <br> 127ed939-34af-41a8-87a0-e2f6174b1877, <br> taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6 <br> 2014-01-26 14:45:00,722 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) FINISH, HSMClearTaskVDSCommand, log <br> id: 336ec5a6 <br> 2014-01-26 14:45:00,724 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand, return: <br> org.ovirt.engine.core.common.businessentities.SpmStatusResult@13652652, <br> <br> log id: 74c38eb7 <br> 2014-01-26 14:45:00,733 INFO <br> [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] <br> (DefaultQuartzScheduler_Worker-79) Running command: <br> SetStoragePoolStatusCommand internal: true. Entities affected : ID: <br> 536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool <br> 2014-01-26 14:45:00,778 ERROR <br> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] <br> (DefaultQuartzScheduler_Worker-79) <br> IrsBroker::Failed::GetStoragePoolInfoVDS due to: <br> IrsSpmStartFailedException: IRSGenericException: IRSErrorException: <br> SpmStart failed <br> <br> Ted Miller <br> Elkhart, IN, USA <br> <br> <br> <br> _______________________________________________ <br> Users mailing list <br> <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <br> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> <br> <br> </blockquote> <br> is this gluster storage (guessing sunce you mentioned a 'volume') <br> </blockquote> yes (mentioned under "setup" above) <br> <blockquote type="cite">does it have a quorum? <br> </blockquote> Volume Name: VM2 <br> Type: Replicate <br> Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e <br> Status: Started <br> Number of Bricks: 1 x 3 = 3 <br> Transport-type: tcp <br> Bricks: <br> Brick1: 10.41.65.2:/bricks/01/VM2 <br> Brick2: 10.41.65.4:/bricks/01/VM2 <br> Brick3: 10.41.65.4:/bricks/101/VM2 <br> Options Reconfigured: <br> cluster.server-quorum-type: server <br> storage.owner-gid: 36 <br> storage.owner-uid: 36 <br> auth.allow: * <br> user.cifs: off <br> nfs.disa <br> <blockquote type="cite">(there were reports of split brain on the domain metadata before when <br> no quorum exist for gluster) <br> </blockquote> after full heal: <br> <br> [root@office4a ~]$ gluster volume heal VM2 info <br> Gathering Heal info on volume VM2 has been successful <br> <br> Brick 10.41.65.2:/bricks/01/VM2 <br> Number of entries: 0 <br> <br> Brick 10.41.65.4:/bricks/01/VM2 <br> Number of entries: 0 <br> <br> Brick 10.41.65.4:/bricks/101/VM2 <br> Number of entries: 0 <br> [root@office4a ~]$ gluster volume heal VM2 info split-brain <br> Gathering Heal info on volume VM2 has been successful <br> <br> Brick 10.41.65.2:/bricks/01/VM2 <br> Number of entries: 0 <br> <br> Brick 10.41.65.4:/bricks/01/VM2 <br> Number of entries: 0 <br> <br> Brick 10.41.65.4:/bricks/101/VM2 <br> Number of entries: 0 <br> <br> noticed this in host /var/log/messages (while looking for something <br> else). Loop seems to repeat over and over. <br> <br> Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678 <br> [30419]: read_sectors delta_leader offset 512 rv -90 <br> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids <br> <br> <br> Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679 <br> [3771]: s1997 add_lockspace fail result -90 <br> Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR <br> Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected <br> error#012Traceback <br> (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", <br> line <br> 857, in _run#012 return fn(*args, **kargs)#012 File <br> "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, <br> **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2111, in <br> getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 <br> File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 <br> raise <br> SecureError()#012SecureError <br> Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686 <br> [30495]: read_sectors delta_leader offset 512 rv -90 <br> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids <br> <br> <br> Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687 <br> [3772]: s1998 add_lockspace fail result -90 <br> Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR <br> Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected <br> error#012Traceback <br> (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", <br> line <br> 857, in _run#012 return fn(*args, **kargs)#012 File <br> "/usr/share/vdsm/storage/task.py", line 318, in run#012 return <br> self.cmd(*self.argslist, **self.argsdict)#012 File <br> "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 <br> self.masterDomain.acquireHostId(self.id)#012 File <br> "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 <br> self._clusterLock.acquireHostId(hostId, async)#012 File <br> "/usr/share/vdsm/storage/clusterlock.py", line 189, in <br> acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, <br> e)#012AcquireHostIdFailure: Cannot acquire host id: <br> ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock <br> lockspace add failure', 'Message too long')) <br> </blockquote> </blockquote> </blockquote> <br> fede - thoughts on above? <br> (vojtech reported something similar, but it sorted out for him after some retries) <br> <br> <blockquote type="cite"> <blockquote type="cite"> <blockquote type="cite"> <br> Ted Miller <br> Elkhart, IN, USA <br> <br> </blockquote> <br> this is the new storage domain? what about the previous volume for the <br> first SD? <br> </blockquote> <br> The default/default data center/cluster had to be abandoned because of a <br> split-brain that could not be healed. Can't remove old storage from <br> database, can't get data center up due to corrupt storage, ends up a <br> circular argument. <br> <br> I started over with same hosts, totally new storage in new data center. <br> This mill/one data center/cluster was working fine with VM2 storage, <br> then died. <br> <br> Ted Miller </blockquote> </blockquote> Looking at the problems with Sanlock, I read the page on it, then went poking around.<br> <br> I find the sanlock.log is full of:<br> <br> <font face="Courier New, Courier, monospace">2014-01-26 17:26:30-0500 21316 [3772]: s3071 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0<br> 2014-01-26 17:26:30-0500 21316 [11733]: 0322a407 aio collect 0 0x7f97380008c0:0x7f97380008d0:0x7f9738101000 result 0:0 match len 512<br> 2014-01-26 17:26:30-0500 21316 [11733]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids<br> 2014-01-26 17:26:31-0500 21317 [3772]: s3071 add_lockspace fail result -90<br> </font><br> Just looking at the log, it looks like sanlock is trying to "read_sectors delta_leader offset 512", probably in this file: /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids<br> <br> Well, I can see why it fails -- the ids file is empty.<br> <font face="Courier New, Courier, monospace">[root@office4a ~]$ ls /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ -l<br> total 1029<br> -rw-rw---- 1 vdsm kvm 0 Jan 22 00:44 ids<br> -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 inbox<br> -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases<br> -rw-r--r-- 1 vdsm kvm 491 Jan 21 18:20 metadata<br> -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 outbox</font><br> <br> I have no idea why it is empty, or what is supposed to be there, but anything trying to read with an offset of 512 is going to kick back an error on a 0 length file. :(<br> <br> I recall that on the previous ids file, on the domain that went up in smoke (improperly recovered split-brain), the file was about 1 megabyte.<br> <br> The date on the 'ids' file is probably about when the Data Center went belly-up.<br> <br> I hope this helps someone figure out what I can do to get out of this endless circle of trying to start the Data Center. Maybe there is a way to force an initialization of the ids file, and then everyone could be happy?<br> <br> Ted Miller<br> <br> </body> </html> --------------030406020605090601060306--

--------------080006060704060003070706 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit On 1/26/2014 6:24 PM, Ted Miller wrote:
On 01/27/2014 12:00 AM, Ted Miller wrote:
On 1/26/2014 4:00 PM, Itamar Heim wrote:
On 01/26/2014 10:51 PM, Ted Miller wrote:
On 1/26/2014 3:10 PM, Itamar Heim wrote:
On 01/26/2014 10:08 PM, Ted Miller wrote: > My Data Center is down, and won't come back up. > > Data Center Status on the GUI flips between "Non Responsive" and > "Contending" > > Also noted: > Host sometimes seen flipping between "Low" and "Contending" in SPM > column. > Storage VM2 "Data (Master)" is in "Cross Data-Center Status" = Unknown > VM2 is "up" under "Volumes" tab > > Created another volume for VM storage. It shows up in "volumes" tab, > but when I try to add "New Domain" in storage tab, says that "There > are > No Data Centers to which the Storage Domain can be attached" > > Setup: > 2 hosts w/ glusterfs storage > 1 engine > all 3 computers Centos 6.5, just updated > ovirt-engine 3.3.0.1-1.el6 > ovirt-engine-lib 3.3.2-1.el6 > ovirt-host-deploy.noarch 1.1.3-1.el6 > glusterfs.x86_64 3.4.2-1.el6 > > This loop seems to repeat in the ovirt-engine log (grep of log showing > only DefaultQuartzScheduler_Worker-79 thread: > > 2014-01-26 14:44:58,416 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (DefaultQuartzScheduler_Worker-79) Irs placed on server > 9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover > 2014-01-26 14:44:58,511 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds - > office4a, > spmStatus Free, storage pool mill > 2014-01-26 14:44:58,550 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (DefaultQuartzScheduler_Worker-79) SpmStatus on vds > 127ed939-34af-41a8-87a0-e2f6174b1877: Free > 2014-01-26 14:44:58,571 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (DefaultQuartzScheduler_Worker-79) starting spm on vds office4a, > storage > pool mill, prevId 2, LVER 15 > 2014-01-26 14:44:58,579 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] > (DefaultQuartzScheduler_Worker-79) START, > SpmStartVDSCommand(HostName = > office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877, > storagePoolId = > 536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15, > storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log > id: 74c38eb7 > 2014-01-26 14:44:58,617 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] > (DefaultQuartzScheduler_Worker-79) spmStart polling started: taskId = > e8986753-fc80-4b11-a11d-6d3470b1728c > 2014-01-26 14:45:00,662 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] > (DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS > method > 2014-01-26 14:45:00,664 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] > (DefaultQuartzScheduler_Worker-79) Error code AcquireHostIdFailure and > error message VDSGenericException: VDSErrorException: Failed to > HSMGetTaskStatusVDS, error = Cannot acquire host id > 2014-01-26 14:45:00,665 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] > (DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId = > e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished > 2014-01-26 14:45:00,666 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] > (DefaultQuartzScheduler_Worker-79) Start SPM Task failed - result: > cleanSuccess, message: VDSGenericException: VDSErrorException: > Failed to > HSMGetTaskStatusVDS, error = Cannot acquire host id > 2014-01-26 14:45:00,695 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] > (DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm > status: Free > 2014-01-26 14:45:00,702 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] > (DefaultQuartzScheduler_Worker-79) START, > HSMClearTaskVDSCommand(HostName = office4a, HostId = > 127ed939-34af-41a8-87a0-e2f6174b1877, > taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6 > 2014-01-26 14:45:00,722 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] > (DefaultQuartzScheduler_Worker-79) FINISH, HSMClearTaskVDSCommand, log > id: 336ec5a6 > 2014-01-26 14:45:00,724 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] > (DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand, return: > org.ovirt.engine.core.common.businessentities.SpmStatusResult@13652652, > > log id: 74c38eb7 > 2014-01-26 14:45:00,733 INFO > [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] > (DefaultQuartzScheduler_Worker-79) Running command: > SetStoragePoolStatusCommand internal: true. Entities affected : ID: > 536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool > 2014-01-26 14:45:00,778 ERROR > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (DefaultQuartzScheduler_Worker-79) > IrsBroker::Failed::GetStoragePoolInfoVDS due to: > IrsSpmStartFailedException: IRSGenericException: IRSErrorException: > SpmStart failed > > Ted Miller > Elkhart, IN, USA > > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >
is this gluster storage (guessing sunce you mentioned a 'volume') yes (mentioned under "setup" above) does it have a quorum? Volume Name: VM2 Type: Replicate Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.41.65.2:/bricks/01/VM2 Brick2: 10.41.65.4:/bricks/01/VM2 Brick3: 10.41.65.4:/bricks/101/VM2 Options Reconfigured: cluster.server-quorum-type: server storage.owner-gid: 36 storage.owner-uid: 36 auth.allow: * user.cifs: off nfs.disa (there were reports of split brain on the domain metadata before when no quorum exist for gluster) after full heal:
[root@office4a ~]$ gluster volume heal VM2 info Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0 [root@office4a ~]$ gluster volume heal VM2 info split-brain Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0
noticed this in host /var/log/messages (while looking for something else). Loop seems to repeat over and over.
Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678 [30419]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679 [3771]: s1997 add_lockspace fail result -90 Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2111, in getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 raise SecureError()#012SecureError Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686 [30495]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687 [3772]: s1998 add_lockspace fail result -90 Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/task.py", line 318, in run#012 return self.cmd(*self.argslist, **self.argsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acquireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock lockspace add failure', 'Message too long'))
fede - thoughts on above? (vojtech reported something similar, but it sorted out for him after some retries)
Ted Miller Elkhart, IN, USA
this is the new storage domain? what about the previous volume for the first SD?
The default/default data center/cluster had to be abandoned because of a split-brain that could not be healed. Can't remove old storage from database, can't get data center up due to corrupt storage, ends up a circular argument.
I started over with same hosts, totally new storage in new data center. This mill/one data center/cluster was working fine with VM2 storage, then died.
Ted Miller Looking at the problems with Sanlock, I read the page on it, then went
On 1/26/2014 5:17 PM, Itamar Heim wrote: poking around.
I find the sanlock.log is full of:
2014-01-26 17:26:30-0500 21316 [3772]: s3071 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 2014-01-26 17:26:30-0500 21316 [11733]: 0322a407 aio collect 0 0x7f97380008c0:0x7f97380008d0:0x7f9738101000 result 0:0 match len 512 2014-01-26 17:26:30-0500 21316 [11733]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids 2014-01-26 17:26:31-0500 21317 [3772]: s3071 add_lockspace fail result -90
Just looking at the log, it looks like sanlock is trying to "read_sectors delta_leader offset 512", probably in this file: /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Well, I can see why it fails -- the ids file is empty. [root@office4a ~]$ ls /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ -l total 1029 -rw-rw---- 1 vdsm kvm 0 Jan 22 00:44 ids -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 inbox -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases -rw-r--r-- 1 vdsm kvm 491 Jan 21 18:20 metadata -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 outbox
I have no idea why it is empty, or what is supposed to be there, but anything trying to read with an offset of 512 is going to kick back an error on a 0 length file. :(
I recall that on the previous ids file, on the domain that went up in smoke (improperly recovered split-brain), the file was about 1 megabyte.
The date on the 'ids' file is probably about when the Data Center went belly-up.
I hope this helps someone figure out what I can do to get out of this endless circle of trying to start the Data Center. Maybe there is a way to force an initialization of the ids file, and then everyone could be happy?
Ted Miller Since this has turned into a quest that seems to be focused on sanlock, I have started a new thread under the subject: "sanlock can't read empty 'ids' file". Ted Miller
--------------080006060704060003070706 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> <br> <div class="moz-cite-prefix">On 1/26/2014 6:24 PM, Ted Miller wrote:<br> </div> <blockquote cite="mid:52E5992A.2070301@hcjb.org" type="cite"> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> On 1/26/2014 5:17 PM, Itamar Heim wrote:<br> <blockquote cite="mid:52E58960.4060607@redhat.com" type="cite">On 01/27/2014 12:00 AM, Ted Miller wrote: <br> <blockquote type="cite"> <br> On 1/26/2014 4:00 PM, Itamar Heim wrote: <br> <blockquote type="cite">On 01/26/2014 10:51 PM, Ted Miller wrote: <br> <blockquote type="cite"> <br> On 1/26/2014 3:10 PM, Itamar Heim wrote: <br> <blockquote type="cite">On 01/26/2014 10:08 PM, Ted Miller wrote: <br> <blockquote type="cite">My Data Center is down, and won't come back up. <br> <br> Data Center Status on the GUI flips between "Non Responsive" and <br> "Contending" <br> <br> Also noted: <br> Host sometimes seen flipping between "Low" and "Contending" in SPM <br> column. <br> Storage VM2 "Data (Master)" is in "Cross Data-Center Status" = Unknown <br> VM2 is "up" under "Volumes" tab <br> <br> Created another volume for VM storage. It shows up in "volumes" tab, <br> but when I try to add "New Domain" in storage tab, says that "There <br> are <br> No Data Centers to which the Storage Domain can be attached" <br> <br> Setup: <br> 2 hosts w/ glusterfs storage <br> 1 engine <br> all 3 computers Centos 6.5, just updated <br> ovirt-engine 3.3.0.1-1.el6 <br> ovirt-engine-lib 3.3.2-1.el6 <br> ovirt-host-deploy.noarch 1.1.3-1.el6 <br> glusterfs.x86_64 3.4.2-1.el6 <br> <br> This loop seems to repeat in the ovirt-engine log (grep of log showing <br> only DefaultQuartzScheduler_Worker-79 thread: <br> <br> 2014-01-26 14:44:58,416 INFO <br> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] <br> (DefaultQuartzScheduler_Worker-79) Irs placed on server <br> 9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover <br> 2014-01-26 14:44:58,511 INFO <br> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] <br> (DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds - <br> office4a, <br> spmStatus Free, storage pool mill <br> 2014-01-26 14:44:58,550 INFO <br> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] <br> (DefaultQuartzScheduler_Worker-79) SpmStatus on vds <br> 127ed939-34af-41a8-87a0-e2f6174b1877: Free <br> 2014-01-26 14:44:58,571 INFO <br> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] <br> (DefaultQuartzScheduler_Worker-79) starting spm on vds office4a, <br> storage <br> pool mill, prevId 2, LVER 15 <br> 2014-01-26 14:44:58,579 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) START, <br> SpmStartVDSCommand(HostName = <br> office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877, <br> storagePoolId = <br> 536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15, <br> storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log <br> id: 74c38eb7 <br> 2014-01-26 14:44:58,617 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) spmStart polling started: taskId = <br> e8986753-fc80-4b11-a11d-6d3470b1728c <br> 2014-01-26 14:45:00,662 ERROR <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS <br> method <br> 2014-01-26 14:45:00,664 ERROR <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) Error code AcquireHostIdFailure and <br> error message VDSGenericException: VDSErrorException: Failed to <br> HSMGetTaskStatusVDS, error = Cannot acquire host id <br> 2014-01-26 14:45:00,665 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId = <br> e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished <br> 2014-01-26 14:45:00,666 ERROR <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) Start SPM Task failed - result: <br> cleanSuccess, message: VDSGenericException: VDSErrorException: <br> Failed to <br> HSMGetTaskStatusVDS, error = Cannot acquire host id <br> 2014-01-26 14:45:00,695 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm <br> status: Free <br> 2014-01-26 14:45:00,702 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) START, <br> HSMClearTaskVDSCommand(HostName = office4a, HostId = <br> 127ed939-34af-41a8-87a0-e2f6174b1877, <br> taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6 <br> 2014-01-26 14:45:00,722 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) FINISH, HSMClearTaskVDSCommand, log <br> id: 336ec5a6 <br> 2014-01-26 14:45:00,724 INFO <br> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] <br> (DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand, return: <br> org.ovirt.engine.core.common.businessentities.SpmStatusResult@13652652, <br> <br> log id: 74c38eb7 <br> 2014-01-26 14:45:00,733 INFO <br> [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] <br> (DefaultQuartzScheduler_Worker-79) Running command: <br> SetStoragePoolStatusCommand internal: true. Entities affected : ID: <br> 536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool <br> 2014-01-26 14:45:00,778 ERROR <br> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] <br> (DefaultQuartzScheduler_Worker-79) <br> IrsBroker::Failed::GetStoragePoolInfoVDS due to: <br> IrsSpmStartFailedException: IRSGenericException: IRSErrorException: <br> SpmStart failed <br> <br> Ted Miller <br> Elkhart, IN, USA <br> <br> <br> <br> _______________________________________________ <br> Users mailing list <br> <a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <br> <a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> <br> <br> </blockquote> <br> is this gluster storage (guessing sunce you mentioned a 'volume') <br> </blockquote> yes (mentioned under "setup" above) <br> <blockquote type="cite">does it have a quorum? <br> </blockquote> Volume Name: VM2 <br> Type: Replicate <br> Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e <br> Status: Started <br> Number of Bricks: 1 x 3 = 3 <br> Transport-type: tcp <br> Bricks: <br> Brick1: 10.41.65.2:/bricks/01/VM2 <br> Brick2: 10.41.65.4:/bricks/01/VM2 <br> Brick3: 10.41.65.4:/bricks/101/VM2 <br> Options Reconfigured: <br> cluster.server-quorum-type: server <br> storage.owner-gid: 36 <br> storage.owner-uid: 36 <br> auth.allow: * <br> user.cifs: off <br> nfs.disa <br> <blockquote type="cite">(there were reports of split brain on the domain metadata before when <br> no quorum exist for gluster) <br> </blockquote> after full heal: <br> <br> [root@office4a ~]$ gluster volume heal VM2 info <br> Gathering Heal info on volume VM2 has been successful <br> <br> Brick 10.41.65.2:/bricks/01/VM2 <br> Number of entries: 0 <br> <br> Brick 10.41.65.4:/bricks/01/VM2 <br> Number of entries: 0 <br> <br> Brick 10.41.65.4:/bricks/101/VM2 <br> Number of entries: 0 <br> [root@office4a ~]$ gluster volume heal VM2 info split-brain <br> Gathering Heal info on volume VM2 has been successful <br> <br> Brick 10.41.65.2:/bricks/01/VM2 <br> Number of entries: 0 <br> <br> Brick 10.41.65.4:/bricks/01/VM2 <br> Number of entries: 0 <br> <br> Brick 10.41.65.4:/bricks/101/VM2 <br> Number of entries: 0 <br> <br> noticed this in host /var/log/messages (while looking for something <br> else). Loop seems to repeat over and over. <br> <br> Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678 <br> [30419]: read_sectors delta_leader offset 512 rv -90 <br> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids <br> <br> <br> Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679 <br> [3771]: s1997 add_lockspace fail result -90 <br> Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR <br> Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected <br> error#012Traceback <br> (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", <br> line <br> 857, in _run#012 return fn(*args, **kargs)#012 File <br> "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, <br> **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2111, in <br> getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 <br> File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 <br> raise <br> SecureError()#012SecureError <br> Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686 <br> [30495]: read_sectors delta_leader offset 512 rv -90 <br> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids <br> <br> <br> Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687 <br> [3772]: s1998 add_lockspace fail result -90 <br> Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR <br> Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected <br> error#012Traceback <br> (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", <br> line <br> 857, in _run#012 return fn(*args, **kargs)#012 File <br> "/usr/share/vdsm/storage/task.py", line 318, in run#012 return <br> self.cmd(*self.argslist, **self.argsdict)#012 File <br> "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 <br> self.masterDomain.acquireHostId(self.id)#012 File <br> "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 <br> self._clusterLock.acquireHostId(hostId, async)#012 File <br> "/usr/share/vdsm/storage/clusterlock.py", line 189, in <br> acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, <br> e)#012AcquireHostIdFailure: Cannot acquire host id: <br> ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock <br> lockspace add failure', 'Message too long')) <br> </blockquote> </blockquote> </blockquote> <br> fede - thoughts on above? <br> (vojtech reported something similar, but it sorted out for him after some retries) <br> <br> <blockquote type="cite"> <blockquote type="cite"> <blockquote type="cite"> <br> Ted Miller <br> Elkhart, IN, USA <br> <br> </blockquote> <br> this is the new storage domain? what about the previous volume for the <br> first SD? <br> </blockquote> <br> The default/default data center/cluster had to be abandoned because of a <br> split-brain that could not be healed. Can't remove old storage from <br> database, can't get data center up due to corrupt storage, ends up a <br> circular argument. <br> <br> I started over with same hosts, totally new storage in new data center. <br> This mill/one data center/cluster was working fine with VM2 storage, <br> then died. <br> <br> Ted Miller </blockquote> </blockquote> Looking at the problems with Sanlock, I read the page on it, then went poking around.<br> <br> I find the sanlock.log is full of:<br> <br> <font face="Courier New, Courier, monospace">2014-01-26 17:26:30-0500 21316 [3772]: s3071 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0<br> 2014-01-26 17:26:30-0500 21316 [11733]: 0322a407 aio collect 0 0x7f97380008c0:0x7f97380008d0:0x7f9738101000 result 0:0 match len 512<br> 2014-01-26 17:26:30-0500 21316 [11733]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids<br> 2014-01-26 17:26:31-0500 21317 [3772]: s3071 add_lockspace fail result -90<br> </font><br> Just looking at the log, it looks like sanlock is trying to "read_sectors delta_leader offset 512", probably in this file: /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids<br> <br> Well, I can see why it fails -- the ids file is empty.<br> <font face="Courier New, Courier, monospace">[root@office4a ~]$ ls /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ -l<br> total 1029<br> -rw-rw---- 1 vdsm kvm 0 Jan 22 00:44 ids<br> -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 inbox<br> -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases<br> -rw-r--r-- 1 vdsm kvm 491 Jan 21 18:20 metadata<br> -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 outbox</font><br> <br> I have no idea why it is empty, or what is supposed to be there, but anything trying to read with an offset of 512 is going to kick back an error on a 0 length file. :(<br> <br> I recall that on the previous ids file, on the domain that went up in smoke (improperly recovered split-brain), the file was about 1 megabyte.<br> <br> The date on the 'ids' file is probably about when the Data Center went belly-up.<br> <br> I hope this helps someone figure out what I can do to get out of this endless circle of trying to start the Data Center. Maybe there is a way to force an initialization of the ids file, and then everyone could be happy?<br> <br> Ted Miller<br> </blockquote> Since this has turned into a quest that seems to be focused on sanlock, I have started a new thread under the subject: "sanlock can't read empty 'ids' file".<br> Ted Miller<br> <br> </body> </html> --------------080006060704060003070706--

----- Original Message -----
From: "Itamar Heim" <iheim@redhat.com> To: "Ted Miller" <tmiller@hcjb.org>, users@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com> Cc: "Allon Mureinik" <amureini@redhat.com> Sent: Sunday, January 26, 2014 11:17:04 PM Subject: Re: [Users] Data Center stuck between "Non Responsive" and "Contending"
On 01/27/2014 12:00 AM, Ted Miller wrote:
On 1/26/2014 4:00 PM, Itamar Heim wrote:
On 01/26/2014 10:51 PM, Ted Miller wrote:
On 1/26/2014 3:10 PM, Itamar Heim wrote:
On 01/26/2014 10:08 PM, Ted Miller wrote: is this gluster storage (guessing sunce you mentioned a 'volume') yes (mentioned under "setup" above) does it have a quorum? Volume Name: VM2 Type: Replicate Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.41.65.2:/bricks/01/VM2 Brick2: 10.41.65.4:/bricks/01/VM2 Brick3: 10.41.65.4:/bricks/101/VM2 Options Reconfigured: cluster.server-quorum-type: server storage.owner-gid: 36 storage.owner-uid: 36 auth.allow: * user.cifs: off nfs.disa (there were reports of split brain on the domain metadata before when no quorum exist for gluster) after full heal:
[root@office4a ~]$ gluster volume heal VM2 info Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0 [root@office4a ~]$ gluster volume heal VM2 info split-brain Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0
noticed this in host /var/log/messages (while looking for something else). Loop seems to repeat over and over.
Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678 [30419]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679 [3771]: s1997 add_lockspace fail result -90 Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2111, in getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 raise SecureError()#012SecureError Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686 [30495]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687 [3772]: s1998 add_lockspace fail result -90 Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/task.py", line 318, in run#012 return self.cmd(*self.argslist, **self.argsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acquireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock lockspace add failure', 'Message too long'))
fede - thoughts on above? (vojtech reported something similar, but it sorted out for him after some retries)
Something truncated the ids file, as also reported by:
[root@office4a ~]$ ls /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ -l total 1029 -rw-rw---- 1 vdsm kvm 0 Jan 22 00:44 ids -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 inbox -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases -rw-r--r-- 1 vdsm kvm 491 Jan 21 18:20 metadata -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 outbox
In the past I saw that happening because of a glusterfs bug: https://bugzilla.redhat.com/show_bug.cgi?id=862975 Anyway in general it seems that glusterfs is not always able to reconcile the ids file (as it's written by all the hosts at the same time). Maybe someone from gluster can identify easily what happened. Meanwhile if you just want to repair your data-center you could try with: $ cd /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ $ touch ids $ sanlock direct init -s 0322a407-2b16-40dc-ac67-13d387c6eb4c:0:ids:1048576 -- Federico

I'm adding Vijay to see if he can help here. Dafna On 01/27/2014 08:47 AM, Federico Simoncelli wrote:
----- Original Message -----
From: "Itamar Heim" <iheim@redhat.com> To: "Ted Miller" <tmiller@hcjb.org>, users@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com> Cc: "Allon Mureinik" <amureini@redhat.com> Sent: Sunday, January 26, 2014 11:17:04 PM Subject: Re: [Users] Data Center stuck between "Non Responsive" and "Contending"
On 01/27/2014 12:00 AM, Ted Miller wrote:
On 1/26/2014 4:00 PM, Itamar Heim wrote:
On 01/26/2014 10:51 PM, Ted Miller wrote:
On 1/26/2014 3:10 PM, Itamar Heim wrote:
On 01/26/2014 10:08 PM, Ted Miller wrote: is this gluster storage (guessing sunce you mentioned a 'volume') yes (mentioned under "setup" above) does it have a quorum? Volume Name: VM2 Type: Replicate Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.41.65.2:/bricks/01/VM2 Brick2: 10.41.65.4:/bricks/01/VM2 Brick3: 10.41.65.4:/bricks/101/VM2 Options Reconfigured: cluster.server-quorum-type: server storage.owner-gid: 36 storage.owner-uid: 36 auth.allow: * user.cifs: off nfs.disa (there were reports of split brain on the domain metadata before when no quorum exist for gluster) after full heal:
[root@office4a ~]$ gluster volume heal VM2 info Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0 [root@office4a ~]$ gluster volume heal VM2 info split-brain Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0
noticed this in host /var/log/messages (while looking for something else). Loop seems to repeat over and over.
Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678 [30419]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679 [3771]: s1997 add_lockspace fail result -90 Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2111, in getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 raise SecureError()#012SecureError Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686 [30495]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687 [3772]: s1998 add_lockspace fail result -90 Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/task.py", line 318, in run#012 return self.cmd(*self.argslist, **self.argsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acquireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock lockspace add failure', 'Message too long')) fede - thoughts on above? (vojtech reported something similar, but it sorted out for him after some retries) Something truncated the ids file, as also reported by:
[root@office4a ~]$ ls /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ -l total 1029 -rw-rw---- 1 vdsm kvm 0 Jan 22 00:44 ids -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 inbox -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases -rw-r--r-- 1 vdsm kvm 491 Jan 21 18:20 metadata -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 outbox In the past I saw that happening because of a glusterfs bug:
https://bugzilla.redhat.com/show_bug.cgi?id=862975
Anyway in general it seems that glusterfs is not always able to reconcile the ids file (as it's written by all the hosts at the same time).
Maybe someone from gluster can identify easily what happened. Meanwhile if you just want to repair your data-center you could try with:
$ cd /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ $ touch ids $ sanlock direct init -s 0322a407-2b16-40dc-ac67-13d387c6eb4c:0:ids:1048576
-- Dafna Ron

On 1/27/2014 3:47 AM, Federico Simoncelli wrote:
----- Original Message -----
From: "Itamar Heim" <iheim@redhat.com> To: "Ted Miller" <tmiller@hcjb.org>, users@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com> Cc: "Allon Mureinik" <amureini@redhat.com> Sent: Sunday, January 26, 2014 11:17:04 PM Subject: Re: [Users] Data Center stuck between "Non Responsive" and "Contending"
On 01/27/2014 12:00 AM, Ted Miller wrote:
On 1/26/2014 4:00 PM, Itamar Heim wrote:
On 01/26/2014 10:51 PM, Ted Miller wrote:
On 1/26/2014 3:10 PM, Itamar Heim wrote:
On 01/26/2014 10:08 PM, Ted Miller wrote: is this gluster storage (guessing sunce you mentioned a 'volume') yes (mentioned under "setup" above) does it have a quorum? Volume Name: VM2 Type: Replicate Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.41.65.2:/bricks/01/VM2 Brick2: 10.41.65.4:/bricks/01/VM2 Brick3: 10.41.65.4:/bricks/101/VM2 Options Reconfigured: cluster.server-quorum-type: server storage.owner-gid: 36 storage.owner-uid: 36 auth.allow: * user.cifs: off nfs.disa (there were reports of split brain on the domain metadata before when no quorum exist for gluster) after full heal:
[root@office4a ~]$ gluster volume heal VM2 info Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0 [root@office4a ~]$ gluster volume heal VM2 info split-brain Gathering Heal info on volume VM2 has been successful
Brick 10.41.65.2:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/01/VM2 Number of entries: 0
Brick 10.41.65.4:/bricks/101/VM2 Number of entries: 0
noticed this in host /var/log/messages (while looking for something else). Loop seems to repeat over and over.
Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678 [30419]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679 [3771]: s1997 add_lockspace fail result -90 Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2111, in getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 raise SecureError()#012SecureError Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686 [30495]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687 [3772]: s1998 add_lockspace fail result -90 Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/task.py", line 318, in run#012 return self.cmd(*self.argslist, **self.argsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acquireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock lockspace add failure', 'Message too long')) fede - thoughts on above? (vojtech reported something similar, but it sorted out for him after some retries) Something truncated the ids file, as also reported by:
[root@office4a ~]$ ls /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ -l total 1029 -rw-rw---- 1 vdsm kvm 0 Jan 22 00:44 ids -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 inbox -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases -rw-r--r-- 1 vdsm kvm 491 Jan 21 18:20 metadata -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 outbox In the past I saw that happening because of a glusterfs bug:
https://bugzilla.redhat.com/show_bug.cgi?id=862975
Anyway in general it seems that glusterfs is not always able to reconcile the ids file (as it's written by all the hosts at the same time).
Maybe someone from gluster can identify easily what happened. Meanwhile if you just want to repair your data-center you could try with:
$ cd /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ $ touch ids $ sanlock direct init -s 0322a407-2b16-40dc-ac67-13d387c6eb4c:0:ids:1048576 Federico,
I won't be able to do anything to the ovirt setup for another 5 hours or so (it is a trial system I am working on at home, I am at work), but I will try your repair script and report back. In bugzilla 862975 they suggested turning off write-behind caching and "eager locking" on the gluster volume to avoid/reduce the problems that come from many different computers all writing to the same file(s) on a very frequent basis. If I interpret the comment in the bug correctly, it did seem to help in that situation. My situation is a little different. My gluster setup is replicate only, replica 3 (though there are only two hosts). I was not stress-testing it, I was just using it, trying to figure out how I can import some old VMWare VMs without an ESXi server to run them on. I am guessing that what makes cluster storage have the (Master) designation is that this is the one that actually contains the sanlocks? If so, would it make sense to set up a gluster volume to be (Master), but not use it for VM storage, just for storing the sanlock info? Separate gluster volume(s) could then have the VMs on it(them), and would not need the optimizations turned off. If it is any help, I put some excerpts from my sanlock.log files in a separate thread with the subject: "sanlock can't read empty 'ids' file" Thank you for your prompt response, . Ted Miller Elkhart, IN

----- Original Message -----
From: "Ted Miller" <tmiller@hcjb.org> To: "Federico Simoncelli" <fsimonce@redhat.com>, "Itamar Heim" <iheim@redhat.com> Cc: users@ovirt.org Sent: Monday, January 27, 2014 7:16:14 PM Subject: Re: [Users] Data Center stuck between "Non Responsive" and "Contending"
On 1/27/2014 3:47 AM, Federico Simoncelli wrote:
Maybe someone from gluster can identify easily what happened. Meanwhile if you just want to repair your data-center you could try with:
$ cd /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ $ touch ids $ sanlock direct init -s 0322a407-2b16-40dc-ac67-13d387c6eb4c:0:ids:1048576 Federico,
I won't be able to do anything to the ovirt setup for another 5 hours or so (it is a trial system I am working on at home, I am at work), but I will try your repair script and report back.
In bugzilla 862975 they suggested turning off write-behind caching and "eager locking" on the gluster volume to avoid/reduce the problems that come from many different computers all writing to the same file(s) on a very frequent basis. If I interpret the comment in the bug correctly, it did seem to help in that situation. My situation is a little different. My gluster setup is replicate only, replica 3 (though there are only two hosts). I was not stress-testing it, I was just using it, trying to figure out how I can import some old VMWare VMs without an ESXi server to run them on.
Have you done anything similar to what is described here in comment 21? https://bugzilla.redhat.com/show_bug.cgi?id=859589#c21 When did you realize that you weren't able to use the data-center anymore? Can you describe exactly what you did and what happened, for example: 1. I created the data center (up and running) 2. I tried to import some VMs from VMWare 3. During the import (or after it) the data-center went in the contending state ... Did something special happened? I don't know, power loss, split-brain? For example also an excessive load on one of the servers could have triggered a timeout somewhere (forcing the data-center to go back in the contending state). Could you check if any host was fenced? (Forcibly rebooted)
I am guessing that what makes cluster storage have the (Master) designation is that this is the one that actually contains the sanlocks? If so, would it make sense to set up a gluster volume to be (Master), but not use it for VM storage, just for storing the sanlock info? Separate gluster volume(s) could then have the VMs on it(them), and would not need the optimizations turned off.
Any domain must be able to become the master at any time. Without a master the data center is unusable (at the present time), that's why we migrate (or reconstruct) it on another domain when necessary. -- Federico

--------------020403010102000502070801 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: quoted-printable Federico, thank you for your help so far. Lots of more information below= =2E On 1/27/2014 4:46 PM, Federico Simoncelli wrote:
----- Original Message -----
From: "Ted Miller" <tmiller@hcjb.org>
On 1/27/2014 3:47 AM, Federico Simoncelli wrote:
Maybe someone from gluster can identify easily what happened. Meanwhi= le if you just want to repair your data-center you could try with:
$ cd /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16= -40dc-ac67-13d387c6eb4c/dom_md/ $ touch ids $ sanlock direct init -s 0322a407-2b16-40dc-ac67-13d387c6eb4c:0:id= s:1048576
I tried your suggestion, and it helped, but it was not enough. [root@office4a ~]$ cd /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac= 67-13d387c6eb4c/dom_md/ [root@office4a dom_md]$ touch ids [root@office4a dom_md]$ sanlock direct init -s 0322a407-2b16-40dc-ac6= 7-13d387c6eb4c:0:ids:1048576 init done 0 Let me explain a little. When the problem originally happened, the sanlock.log started having -223= =20 error messages. 10 seconds later the log switched from -223 messages to = -90=20 messages. Running your little script changed the error from -90 back to = -223. I hope you can send me another script that will get rid of the -223 messa= ges. Here is the sanlock.log as I ran your script: 2014-01-27 19:40:41-0500 39281 [3803]: s13 lockspace 0322a407-2b16-40= dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/032= 2a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 2014-01-27 19:40:41-0500 39281 [22751]: 0322a407 aio collect 0 0x7f54= 240008c0:0x7f54240008d0:0x7f5424101000 result 0:0 match len 512 2014-01-27 19:40:41-0500 39281 [22751]: read_sectors delta_leader off= set 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b= 16-40dc-ac67-13d387c6eb4c/dom_md/ids 2014-01-27 19:40:42-0500 39282 [3803]: s13 add_lockspace fail result = -90 2014-01-27 19:40:47-0500 39287 [3803]: s14 lockspace 0322a407-2b16-40= dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/032= 2a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 2014-01-27 19:40:47-0500 39287 [22795]: 0322a407 aio collect 0 0x7f54= 240008c0:0x7f54240008d0:0x7f5424101000 result 0:0 match len 512 2014-01-27 19:40:47-0500 39287 [22795]: read_sectors delta_leader off= set 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b= 16-40dc-ac67-13d387c6eb4c/dom_md/ids 2014-01-27 19:40:48-0500 39288 [3803]: s14 add_lockspace fail result = -90 2014-01-27 19:40:56-0500 39296 [3802]: s15 lockspace 0322a407-2b16-40= dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/032= 2a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 2014-01-27 19:40:56-0500 39296 [22866]: verify_leader 2 wrong magic 0= /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-1= 3d387c6eb4c/dom_md/ids 2014-01-27 19:40:56-0500 39296 [22866]: leader1 delta_acquire_begin e= rror -223 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c host_id 2 2014-01-27 19:40:56-0500 39296 [22866]: leader2 path /rhev/data-cente= r/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_m= d/ids offset 0 2014-01-27 19:40:56-0500 39296 [22866]: leader3 m 0 v 0 ss 0 nh 0 mh = 0 oi 0 og 0 lv 0 2014-01-27 19:40:56-0500 39296 [22866]: leader4 sn rn ts 0 cs 0 2014-01-27 19:40:57-0500 39297 [3802]: s15 add_lockspace fail result = -223 2014-01-27 19:40:57-0500 39297 [3802]: s16 lockspace 0322a407-2b16-40= dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/032= 2a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 2014-01-27 19:40:57-0500 39297 [22870]: verify_leader 2 wrong magic 0= /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-1= 3d387c6eb4c/dom_md/ids 2014-01-27 19:40:57-0500 39297 [22870]: leader1 delta_acquire_begin e= rror -223 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c host_id 2 2014-01-27 19:40:57-0500 39297 [22870]: leader2 path /rhev/data-cente= r/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_m= d/ids offset 0 2014-01-27 19:40:57-0500 39297 [22870]: leader3 m 0 v 0 ss 0 nh 0 mh = 0 oi 0 og 0 lv 0 2014-01-27 19:40:57-0500 39297 [22870]: leader4 sn rn ts 0 cs 0 2014-01-27 19:40:58-0500 39298 [3802]: s16 add_lockspace fail result = -223 2014-01-27 19:41:07-0500 39307 [3802]: s17 lockspace 0322a407-2b16-40= dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/032= 2a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 Unfortunately, I think the error looks about the same to vdsm, because=20 /var/log/messages shows the same two lines in the calling scripts on the = callback lists (66 & 425, if I remember right). When I get up in the morning, I will be looking for another magic potion = from=20 your pen. :)
Federico,
I won't be able to do anything to the ovirt setup for another 5 hours = or so (it is a trial system I am working on at home, I am at work), but I w= ill try your repair script and report back.
In bugzilla 862975 they suggested turning off write-behind caching and= "eager locking" on the gluster volume to avoid/reduce the problems that come = from many different computers all writing to the same file(s) on a very fre= quent basis. If I interpret the comment in the bug correctly, it did seem t= o help in that situation. My situation is a little different. My gluster se= tup is replicate only, replica 3 (though there are only two hosts). I was no= t stress-testing it, I was just using it, trying to figure out how I can= import some old VMWare VMs without an ESXi server to run them on. Have you done anything similar to what is described here in comment 21?=
I have no idea, because that bug is so dangerous that the general public = is=20 not allowed to even see it. I get an "You are not authorized to access .= =2E.."=20 message. :) However, see just below.
When did you realize that you weren't able to use the data-center anymo= re? Can you describe exactly what you did and what happened, for example:
1. I created the data center (up and running) 2. I tried to import some VMs from VMWare 3. During the import (or after it) the data-center went in the contendi= ng state ...
Did something special happened?
It took me a few hours of going back through and matching times on logs t= o=20 figure it out, as it has been a week since it happened. When I saw some=20 things in the logs, I finally remembered the circumstances. I will summa= rize=20 it for you, because I think this is what matters. If you want actual log= s,=20 tell me which logs and for which events, and I will package them up for y= ou. The host suspended all VMs because gluster hit the 90% full limit. It too= k me=20 a while to figure out what was going on. I expanded the logical volume, = but=20 the Data Center would not come back up. The GUI events log says: Invalid status on Data Center mill. Setting Data Center status to Non= Responsive (On host office4a, Error: Network error during communications= with the Host.). The host /var/log/messages says: Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 5213= 89 [30933]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD= /10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 5213= 89 [30933]: leader1 delta_renew error -223 lockspace 0322a407-2b16-40dc-a= c67-13d387c6eb4c host_id 2 Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 5213= 89 [30933]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0= 322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 5213= 89 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 5213= 89 [30933]: leader4 sn rn ts 0 cs 0 Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 5213= 89 [30933]: s123 delta_renew verify_leader error -223 Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 5213= 89 [30933]: s123 renewal error -223 delta_length 0 last_success 521369 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [3362]: s123 check_our_lease corrupt -223 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [3362]: s123 kill 32599 sig 15 count 1 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD= /10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: leader1 delta_renew error -223 lockspace 0322a407-2b16-40dc-a= c67-13d387c6eb4c host_id 2 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0= 322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: leader4 sn rn ts 0 cs 0 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: s123 delta_renew verify_leader error -223 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: s123 renewal error -223 delta_length 0 last_success 521369 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD= /10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: leader1 delta_renew error -223 lockspace 0322a407-2b16-40dc-a= c67-13d387c6eb4c host_id 2 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0= 322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: leader4 sn rn ts 0 cs 0 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: s123 delta_renew verify_leader error -223 Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 5213= 90 [30933]: s123 renewal error -223 delta_length 0 last_success 521369 Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 5213= 91 [3362]: s123 kill 32599 sig 15 count 2 Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 5213= 91 [30933]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD= /10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 5213= 91 [30933]: leader1 delta_renew error -223 lockspace 0322a407-2b16-40dc-a= c67-13d387c6eb4c host_id 2 Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 5213= 91 [30933]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0= 322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 5213= 91 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 5213= 91 [30933]: leader4 sn rn ts 0 cs 0 Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 5213= 91 [30933]: s123 delta_renew verify_leader error -223 [snip many repeats] Jan 22 00:44:57 office4a sanlock[3362]: 2014-01-22 00:44:57-0500 5213= 98 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 Jan 22 00:44:57 office4a sanlock[3362]: 2014-01-22 00:44:57-0500 5213= 98 [30933]: leader4 sn rn ts 0 cs 0 Jan 22 00:44:57 office4a sanlock[3362]: 2014-01-22 00:44:57-0500 5213= 98 [30933]: s123 delta_renew verify_leader error -223 Jan 22 00:44:57 office4a sanlock[3362]: 2014-01-22 00:44:57-0500 5213= 98 [30933]: s123 renewal error -223 delta_length 0 last_success 521369 Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 5213= 99 [3362]: s123 kill 32599 sig 15 count 10 Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 5213= 99 [3362]: dead 32599 ci 4 count 10 Jan 22 00:44:58 office4a respawn: slave '/usr/share/vdsm/vdsm --pidfi= le /var/run/vdsm/vdsmd.pid' died, respawning slave Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 5213= 99 [3362]: s123 all pids clear Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 5213= 99 [30933]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD= /10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 5213= 99 [30933]: leader1 delta_renew error -223 lockspace 0322a407-2b16-40dc-a= c67-13d387c6eb4c host_id 2 Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 5213= 99 [30933]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0= 322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 5213= 99 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 5213= 99 [30933]: leader4 sn rn ts 0 cs 0 Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 5213= 99 [30933]: s123 delta_renew verify_leader error -223 Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 5213= 99 [30933]: s123 renewal error -223 delta_length 0 last_success 521369 Jan 22 00:44:59 office4a vdsm vds WARNING Unable to load the json rpc= server module. Please make sure it is installed. Jan 22 00:44:59 office4a vdsm vm.Vm WARNING vmId=3D`d2983edf-7a76-438= f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'unix', 'al= ias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller'= : '0', 'type': 'virtio-serial', 'port': '1'}}' found Jan 22 00:44:59 office4a vdsm vm.Vm WARNING vmId=3D`d2983edf-7a76-438= f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'unix', 'al= ias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller'= : '0', 'type': 'virtio-serial', 'port': '2'}}' found Jan 22 00:44:59 office4a vdsm vm.Vm WARNING vmId=3D`d2983edf-7a76-438= f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'spicevmc',= 'alias': 'channel2', 'type': 'channel', 'address': {'bus': '0', 'control= ler': '0', 'type': 'virtio-serial', 'port': '3'}}' found Jan 22 00:45:00 office4a vdsm vm.Vm WARNING vmId=3D`d2983edf-7a76-438= f-9ea3-6e6dd3d6f54b`::_readPauseCode unsupported by libvirt vm Jan 22 00:45:00 office4a kernel: ata1: hard resetting link Jan 22 00:45:00 office4a kernel: ata1: SATA link up 1.5 Gbps (SStatus= 113 SControl 300) Jan 22 00:45:00 office4a kernel: ata1.00: configured for UDMA/133 Jan 22 00:45:00 office4a kernel: ata1: EH complete Jan 22 00:45:01 office4a kernel: ata2: hard resetting link Jan 22 00:45:01 office4a kernel: ata2: SATA link up 3.0 Gbps (SStatus= 123 SControl 300) Jan 22 00:45:01 office4a kernel: ata2.00: configured for UDMA/133 Jan 22 00:45:01 office4a kernel: ata2: EH complete Jan 22 00:45:02 office4a kernel: ata3: hard resetting link Jan 22 00:45:02 office4a kernel: ata3: SATA link up 3.0 Gbps (SStatus= 123 SControl 300) Jan 22 00:45:02 office4a kernel: ata3.00: configured for UDMA/133 Jan 22 00:45:02 office4a kernel: ata3: EH complete Jan 22 00:45:03 office4a kernel: ata4: hard resetting link Jan 22 00:45:03 office4a kernel: ata4: SATA link down (SStatus 0 SCon= trol 300) Jan 22 00:45:03 office4a kernel: ata4: EH complete Jan 22 00:45:04 office4a kernel: ata5: soft resetting link Jan 22 00:45:04 office4a kernel: ata5.00: configured for UDMA/33 Jan 22 00:45:04 office4a kernel: ata5: EH complete Jan 22 00:45:05 office4a kernel: ata6: soft resetting link Jan 22 00:45:05 office4a kernel: ata6: EH complete Jan 22 00:45:08 office4a kernel: ata1: hard resetting link Jan 22 00:45:08 office4a kernel: ata1: SATA link up 1.5 Gbps (SStatus= 113 SControl 300) Jan 22 00:45:08 office4a kernel: ata1.00: configured for UDMA/133 Jan 22 00:45:08 office4a kernel: ata1: EH complete Jan 22 00:45:09 office4a kernel: ata2: hard resetting link Jan 22 00:45:09 office4a kernel: ata2: SATA link up 3.0 Gbps (SStatus= 123 SControl 300) Jan 22 00:45:09 office4a kernel: ata2.00: configured for UDMA/133 Jan 22 00:45:09 office4a kernel: ata2: EH complete Jan 22 00:45:10 office4a kernel: ata3: hard resetting link Jan 22 00:45:10 office4a kernel: ata3: SATA link up 3.0 Gbps (SStatus= 123 SControl 300) Jan 22 00:45:10 office4a kernel: ata3.00: configured for UDMA/133 Jan 22 00:45:10 office4a kernel: ata3: EH complete Jan 22 00:45:11 office4a kernel: ata4: hard resetting link Jan 22 00:45:11 office4a kernel: ata4: SATA link down (SStatus 0 SCon= trol 300) Jan 22 00:45:11 office4a kernel: ata4: EH complete Jan 22 00:45:12 office4a kernel: ata5: soft resetting link Jan 22 00:45:12 office4a kernel: ata5.00: configured for UDMA/33 Jan 22 00:45:12 office4a kernel: ata5: EH complete Jan 22 00:45:13 office4a kernel: ata6: soft resetting link Jan 22 00:45:13 office4a kernel: ata6: EH complete Jan 22 00:45:15 office4a sanlock[3362]: 2014-01-22 00:45:15-0500 5214= 16 [17150]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center= /mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md= /ids Jan 22 00:45:16 office4a sanlock[3362]: 2014-01-22 00:45:16-0500 5214= 17 [3367]: s124 add_lockspace fail result -90 Jan 22 00:45:20 office4a vdsm vm.Vm WARNING vmId=3D`d2983edf-7a76-438= f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'unix', 'al= ias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller'= : '0', 'type': 'virtio-serial', 'port': '1'}}' found Jan 22 00:45:20 office4a vdsm vm.Vm WARNING vmId=3D`d2983edf-7a76-438= f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'unix', 'al= ias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller'= : '0', 'type': 'virtio-serial', 'port': '2'}}' found Jan 22 00:45:20 office4a vdsm vm.Vm WARNING vmId=3D`d2983edf-7a76-438= f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'spicevmc',= 'alias': 'channel2', 'type': 'channel', 'address': {'bus': '0', 'control= ler': '0', 'type': 'virtio-serial', 'port': '3'}}' found Jan 22 00:45:21 office4a sanlock[3362]: 2014-01-22 00:45:21-0500 5214= 22 [17199]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center= /mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md= /ids Jan 22 00:45:22 office4a sanlock[3362]: 2014-01-22 00:45:22-0500 5214= 23 [3368]: s125 add_lockspace fail result -90 Jan 22 00:45:22 office4a vdsm TaskManager.Task ERROR Task=3D`2fefe6f9= -4dac-45f2-b237-0104ce890146`::Unexpected error#012Traceback (most recent= call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _r= un#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/ta= sk.py", line 318, in run#012 return self.cmd(*self.argslist, **self.ar= gsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#= 012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vds= m/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acq= uireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.= py", line 189, in acquireHostId#012 raise se.AcquireHostIdFailure(self= =2E_sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a40= 7-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock lockspace a= dd failure', 'Message too long')) Jan 22 00:45:25 office4a sanlock[3362]: 2014-01-22 00:45:25-0500 5214= 26 [17243]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center= /mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md= /ids Jan 22 00:45:26 office4a sanlock[3362]: 2014-01-22 00:45:26-0500 5214= 27 [3368]: s126 add_lockspace fail result -90 Jan 22 00:45:33 office4a vdsm TaskManager.Task ERROR Task=3D`8f8931ba= -a9f6-4225-a951-2d608b3b1d70`::Unexpected error#012Traceback (most recent= call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _r= un#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.p= y", line 45, in wrapper#012 res =3D f(*args, **kwargs)#012 File "/usr= /share/vdsm/storage/hsm.py", line 2110, in getAllTasksStatuses#012 all= TasksStatus =3D sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/stora= ge/securable.py", line 66, in wrapper#012 raise SecureError()#012Secur= eError Jan 22 00:45:36 office4a sanlock[3362]: 2014-01-22 00:45:36-0500 5214= 37 [17334]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center= /mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md= /ids Jan 22 00:45:37 office4a sanlock[3362]: 2014-01-22 00:45:37-0500 5214= 38 [3368]: s127 add_lockspace fail result -90 Jan 22 00:45:37 office4a vdsm TaskManager.Task ERROR Task=3D`6b7da394= -63b6-4e52-ac73-ab98f3f1e60f`::Unexpected error#012Traceback (most recent= call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _r= un#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/ta= sk.py", line 318, in run#012 return self.cmd(*self.argslist, **self.ar= gsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#= 012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vds= m/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acq= uireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.= py", line 185, in acquireHostId#012 raise se.AcquireHostIdFailure(self= =2E_sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a40= 7-2b16-40dc-ac67-13d387c6eb4c', SanlockException(115, 'Sanlock lockspace = add failure', 'Operation now in progress')) Jan 22 00:45:46 office4a sanlock[3362]: 2014-01-22 00:45:46-0500 5214= 47 [17410]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center= /mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md= /ids Jan 22 00:45:47 office4a sanlock[3362]: 2014-01-22 00:45:47-0500 5214= 48 [3367]: s128 add_lockspace fail result -90 Jan 22 00:45:48 office4a vdsm TaskManager.Task ERROR Task=3D`c2486380= -a0c8-45d1-a63f-e0e660f1e118`::Unexpected error#012Traceback (most recent= call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _r= un#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.p= y", line 45, in wrapper#012 res =3D f(*args, **kwargs)#012 File "/usr= /share/vdsm/storage/hsm.py", line 2110, in getAllTasksStatuses#012 all= TasksStatus =3D sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/stora= ge/securable.py", line 66, in wrapper#012 raise SecureError()#012Secur= eError Jan 22 00:45:50 office4a sanlock[3362]: 2014-01-22 00:45:50-0500 5214= 51 [17457]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center= /mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md= /ids Jan 22 00:45:51 office4a sanlock[3362]: 2014-01-22 00:45:51-0500 5214= 52 [3368]: s129 add_lockspace fail result -90
I am guessing that what makes cluster storage have the (Master) design=
ation
is that this is the one that actually contains the sanlocks? If so, w= ould it make sense to set up a gluster volume to be (Master), but not use it f= or VM storage, just for storing the sanlock info? Separate gluster volume(s= ) could then have the VMs on it(them), and would not need the optimizations tu= rned off. Any domain must be able to become the master at any time. Without a mas= ter the data center is unusable (at the present time), that's why we migrat= e (or reconstruct) it on another domain when necessary. Ted Miller
--------------020403010102000502070801 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> Federico, thank you for your help so far. Lots of more information below.<br> <br> On 1/27/2014 4:46 PM, Federico Simoncelli wrote:<br> <blockquote cite="mid:1916841992.8207492.1390859170488.JavaMail.root@redhat.com" type="cite"> <pre wrap="">----- Original Message ----- </pre> <blockquote type="cite"> <pre wrap="">From: "Ted Miller" <a class="moz-txt-link-rfc2396E" href="mailto:tmiller@hcjb.org"><tmiller@hcjb.org></a> On 1/27/2014 3:47 AM, Federico Simoncelli wrote: </pre> <blockquote type="cite"> <pre wrap="">Maybe someone from gluster can identify easily what happened. Meanwhile if you just want to repair your data-center you could try with: $ cd /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ $ touch ids $ sanlock direct init -s 0322a407-2b16-40dc-ac67-13d387c6eb4c:0:ids:1048576</pre> </blockquote> </blockquote> </blockquote> <br> I tried your suggestion, and it helped, but it was not enough. <br> <blockquote> <pre>[root@office4a ~]$ cd <span class="moz-txt-slash"><span class="moz-txt-tag">/</span>rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md<span class="moz-txt-tag">/</span></span> </pre> <pre>[root@office4a dom_md]$ touch ids </pre> <pre>[root@office4a dom_md]$ sanlock direct init -s 0322a407-2b16-40dc-ac67-13d387c6eb4c:0:ids:1048576 </pre> <pre>init done 0 </pre> </blockquote> Let me explain a little.<br> <br> When the problem originally happened, the sanlock.log started having -223 error messages. 10 seconds later the log switched from -223 messages to -90 messages. Running your little script changed the error from -90 back to -223.<br> <br> I hope you can send me another script that will get rid of the -223 messages.<br> <br> Here is the sanlock.log as I ran your script:<br> <br> <blockquote> <pre>2014-01-27 19:40:41-0500 39281 [3803]: s13 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 </pre> <pre>2014-01-27 19:40:41-0500 39281 [22751]: 0322a407 aio collect 0 0x7f54240008c0:0x7f54240008d0:0x7f5424101000 result 0:0 match len 512 </pre> <pre>2014-01-27 19:40:41-0500 39281 [22751]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>2014-01-27 19:40:42-0500 39282 [3803]: s13 add_lockspace fail result -90 </pre> <pre>2014-01-27 19:40:47-0500 39287 [3803]: s14 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 </pre> <pre>2014-01-27 19:40:47-0500 39287 [22795]: 0322a407 aio collect 0 0x7f54240008c0:0x7f54240008d0:0x7f5424101000 result 0:0 match len 512 </pre> <pre>2014-01-27 19:40:47-0500 39287 [22795]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>2014-01-27 19:40:48-0500 39288 [3803]: s14 add_lockspace fail result -90 </pre> <pre>2014-01-27 19:40:56-0500 39296 [3802]: s15 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 </pre> <pre>2014-01-27 19:40:56-0500 39296 [22866]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>2014-01-27 19:40:56-0500 39296 [22866]: leader1 delta_acquire_begin error -223 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c host_id 2 </pre> <pre>2014-01-27 19:40:56-0500 39296 [22866]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 </pre> <pre>2014-01-27 19:40:56-0500 39296 [22866]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 </pre> <pre>2014-01-27 19:40:56-0500 39296 [22866]: leader4 sn rn ts 0 cs 0 </pre> <pre>2014-01-27 19:40:57-0500 39297 [3802]: s15 add_lockspace fail result -223 </pre> <pre>2014-01-27 19:40:57-0500 39297 [3802]: s16 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 </pre> <pre>2014-01-27 19:40:57-0500 39297 [22870]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>2014-01-27 19:40:57-0500 39297 [22870]: leader1 delta_acquire_begin error -223 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c host_id 2 </pre> <pre>2014-01-27 19:40:57-0500 39297 [22870]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 </pre> <pre>2014-01-27 19:40:57-0500 39297 [22870]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 </pre> <pre>2014-01-27 19:40:57-0500 39297 [22870]: leader4 sn rn ts 0 cs 0 </pre> <pre>2014-01-27 19:40:58-0500 39298 [3802]: s16 add_lockspace fail result -223 </pre> <pre>2014-01-27 19:41:07-0500 39307 [3802]: s17 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0 </pre> </blockquote> Unfortunately, I think the error looks about the same to vdsm, because /var/log/messages shows the same two lines in the calling scripts on the callback lists (66 & 425, if I remember right).<br> <br> When I get up in the morning, I will be looking for another magic potion from your pen. :)<br> <br> <blockquote cite="mid:1916841992.8207492.1390859170488.JavaMail.root@redhat.com" type="cite"> <blockquote type="cite"> <pre wrap="">Federico, I won't be able to do anything to the ovirt setup for another 5 hours or so (it is a trial system I am working on at home, I am at work), but I will try your repair script and report back. In bugzilla 862975 they suggested turning off write-behind caching and "eager locking" on the gluster volume to avoid/reduce the problems that come from many different computers all writing to the same file(s) on a very frequent basis. If I interpret the comment in the bug correctly, it did seem to help in that situation. My situation is a little different. My gluster setup is replicate only, replica 3 (though there are only two hosts). I was not stress-testing it, I was just using it, trying to figure out how I can import some old VMWare VMs without an ESXi server to run them on. </pre> </blockquote> <pre wrap=""> Have you done anything similar to what is described here in comment 21? <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=859589#c21">https://bugzilla.redhat.com/show_bug.cgi?id=859589#c21</a></pre> </blockquote> <br> I have no idea, because that bug is so dangerous that the general public is not allowed to even see it. I get an "You are not authorized to access ...." message. :) However, see just below.<br> <br> <blockquote cite="mid:1916841992.8207492.1390859170488.JavaMail.root@redhat.com" type="cite"> <pre wrap="">When did you realize that you weren't able to use the data-center anymore? Can you describe exactly what you did and what happened, for example: 1. I created the data center (up and running) 2. I tried to import some VMs from VMWare 3. During the import (or after it) the data-center went in the contending state ... Did something special happened? </pre> </blockquote> <br> It took me a few hours of going back through and matching times on logs to figure it out, as it has been a week since it happened. When I saw some things in the logs, I finally remembered the circumstances. I will summarize it for you, because I think this is what matters. If you want actual logs, tell me which logs and for which events, and I will package them up for you.<br> <br> The host suspended all VMs because gluster hit the 90% full limit. It took me a while to figure out what was going on. I expanded the logical volume, but the Data Center would not come back up.<br> <br> The GUI events log says: <br> <blockquote> <pre>Invalid status on Data Center mill. Setting Data Center status to Non Responsive (On host office4a, Error: Network error during communications with the Host.).</pre> </blockquote> <br> The host /var/log/messages says:<br> <br> <div class="moz-text-flowed" style="font-size: 15px;" lang="x-western"> <blockquote> <pre>Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 521389 [30933]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 521389 [30933]: leader1 delta_renew error -223 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c host_id 2 </pre> <pre>Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 521389 [30933]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 </pre> <pre>Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 521389 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 </pre> <pre>Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 521389 [30933]: leader4 sn rn ts 0 cs 0 </pre> <pre>Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 521389 [30933]: s123 delta_renew verify_leader error -223 </pre> <pre>Jan 22 00:44:48 office4a sanlock[3362]: 2014-01-22 00:44:48-0500 521389 [30933]: s123 renewal error -223 delta_length 0 last_success 521369 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [3362]: s123 check_our_lease corrupt -223 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [3362]: s123 kill 32599 sig 15 count 1 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: leader1 delta_renew error -223 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c host_id 2 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: leader4 sn rn ts 0 cs 0 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: s123 delta_renew verify_leader error -223 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: s123 renewal error -223 delta_length 0 last_success 521369 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: leader1 delta_renew error -223 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c host_id 2 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: leader4 sn rn ts 0 cs 0 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: s123 delta_renew verify_leader error -223 </pre> <pre>Jan 22 00:44:49 office4a sanlock[3362]: 2014-01-22 00:44:49-0500 521390 [30933]: s123 renewal error -223 delta_length 0 last_success 521369 </pre> <pre>Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 521391 [3362]: s123 kill 32599 sig 15 count 2 </pre> <pre>Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 521391 [30933]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 521391 [30933]: leader1 delta_renew error -223 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c host_id 2 </pre> <pre>Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 521391 [30933]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 </pre> <pre>Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 521391 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 </pre> <pre>Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 521391 [30933]: leader4 sn rn ts 0 cs 0 </pre> <pre>Jan 22 00:44:50 office4a sanlock[3362]: 2014-01-22 00:44:50-0500 521391 [30933]: s123 delta_renew verify_leader error -223 </pre> <pre>[snip many repeats] Jan 22 00:44:57 office4a sanlock[3362]: 2014-01-22 00:44:57-0500 521398 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 </pre> <pre>Jan 22 00:44:57 office4a sanlock[3362]: 2014-01-22 00:44:57-0500 521398 [30933]: leader4 sn rn ts 0 cs 0 </pre> <pre>Jan 22 00:44:57 office4a sanlock[3362]: 2014-01-22 00:44:57-0500 521398 [30933]: s123 delta_renew verify_leader error -223 </pre> <pre>Jan 22 00:44:57 office4a sanlock[3362]: 2014-01-22 00:44:57-0500 521398 [30933]: s123 renewal error -223 delta_length 0 last_success 521369 </pre> <pre>Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 521399 [3362]: s123 kill 32599 sig 15 count 10 </pre> <pre>Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 521399 [3362]: dead 32599 ci 4 count 10 </pre> <pre>Jan 22 00:44:58 office4a respawn: slave '/usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid' died, respawning slave </pre> <pre>Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 521399 [3362]: s123 all pids clear </pre> <pre>Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 521399 [30933]: verify_leader 2 wrong magic 0 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 521399 [30933]: leader1 delta_renew error -223 lockspace 0322a407-2b16-40dc-ac67-13d387c6eb4c host_id 2 </pre> <pre>Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 521399 [30933]: leader2 path /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids offset 0 </pre> <pre>Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 521399 [30933]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 </pre> <pre>Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 521399 [30933]: leader4 sn rn ts 0 cs 0 </pre> <pre>Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 521399 [30933]: s123 delta_renew verify_leader error -223 </pre> <pre>Jan 22 00:44:58 office4a sanlock[3362]: 2014-01-22 00:44:58-0500 521399 [30933]: s123 renewal error -223 delta_length 0 last_success 521369 </pre> <pre>Jan 22 00:44:59 office4a vdsm vds WARNING Unable to load the json rpc server module. Please make sure it is installed. </pre> <pre>Jan 22 00:44:59 office4a vdsm vm.Vm WARNING vmId=`d2983edf-7a76-438f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'unix', 'alias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '1'}}' found </pre> <pre>Jan 22 00:44:59 office4a vdsm vm.Vm WARNING vmId=`d2983edf-7a76-438f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'unix', 'alias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '2'}}' found </pre> <pre>Jan 22 00:44:59 office4a vdsm vm.Vm WARNING vmId=`d2983edf-7a76-438f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'spicevmc', 'alias': 'channel2', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '3'}}' found </pre> <pre>Jan 22 00:45:00 office4a vdsm vm.Vm WARNING vmId=`d2983edf-7a76-438f-9ea3-6e6dd3d6f54b`::_readPauseCode unsupported by libvirt vm </pre> <pre>Jan 22 00:45:00 office4a kernel: ata1: hard resetting link </pre> <pre>Jan 22 00:45:00 office4a kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) </pre> <pre>Jan 22 00:45:00 office4a kernel: ata1.00: configured for UDMA/133 </pre> <pre>Jan 22 00:45:00 office4a kernel: ata1: EH complete </pre> <pre>Jan 22 00:45:01 office4a kernel: ata2: hard resetting link </pre> <pre>Jan 22 00:45:01 office4a kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) </pre> <pre>Jan 22 00:45:01 office4a kernel: ata2.00: configured for UDMA/133 </pre> <pre>Jan 22 00:45:01 office4a kernel: ata2: EH complete </pre> <pre>Jan 22 00:45:02 office4a kernel: ata3: hard resetting link </pre> <pre>Jan 22 00:45:02 office4a kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) </pre> <pre>Jan 22 00:45:02 office4a kernel: ata3.00: configured for UDMA/133 </pre> <pre>Jan 22 00:45:02 office4a kernel: ata3: EH complete </pre> <pre>Jan 22 00:45:03 office4a kernel: ata4: hard resetting link </pre> <pre>Jan 22 00:45:03 office4a kernel: ata4: SATA link down (SStatus 0 SControl 300) </pre> <pre>Jan 22 00:45:03 office4a kernel: ata4: EH complete </pre> <pre>Jan 22 00:45:04 office4a kernel: ata5: soft resetting link </pre> <pre>Jan 22 00:45:04 office4a kernel: ata5.00: configured for UDMA/33 </pre> <pre>Jan 22 00:45:04 office4a kernel: ata5: EH complete </pre> <pre>Jan 22 00:45:05 office4a kernel: ata6: soft resetting link </pre> <pre>Jan 22 00:45:05 office4a kernel: ata6: EH complete </pre> <pre>Jan 22 00:45:08 office4a kernel: ata1: hard resetting link </pre> <pre>Jan 22 00:45:08 office4a kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) </pre> <pre>Jan 22 00:45:08 office4a kernel: ata1.00: configured for UDMA/133 </pre> <pre>Jan 22 00:45:08 office4a kernel: ata1: EH complete </pre> <pre>Jan 22 00:45:09 office4a kernel: ata2: hard resetting link </pre> <pre>Jan 22 00:45:09 office4a kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) </pre> <pre>Jan 22 00:45:09 office4a kernel: ata2.00: configured for UDMA/133 </pre> <pre>Jan 22 00:45:09 office4a kernel: ata2: EH complete </pre> <pre>Jan 22 00:45:10 office4a kernel: ata3: hard resetting link </pre> <pre>Jan 22 00:45:10 office4a kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) </pre> <pre>Jan 22 00:45:10 office4a kernel: ata3.00: configured for UDMA/133 </pre> <pre>Jan 22 00:45:10 office4a kernel: ata3: EH complete </pre> <pre>Jan 22 00:45:11 office4a kernel: ata4: hard resetting link </pre> <pre>Jan 22 00:45:11 office4a kernel: ata4: SATA link down (SStatus 0 SControl 300) </pre> <pre>Jan 22 00:45:11 office4a kernel: ata4: EH complete </pre> <pre>Jan 22 00:45:12 office4a kernel: ata5: soft resetting link </pre> <pre>Jan 22 00:45:12 office4a kernel: ata5.00: configured for UDMA/33 </pre> <pre>Jan 22 00:45:12 office4a kernel: ata5: EH complete </pre> <pre>Jan 22 00:45:13 office4a kernel: ata6: soft resetting link </pre> <pre>Jan 22 00:45:13 office4a kernel: ata6: EH complete </pre> <pre>Jan 22 00:45:15 office4a sanlock[3362]: 2014-01-22 00:45:15-0500 521416 [17150]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:45:16 office4a sanlock[3362]: 2014-01-22 00:45:16-0500 521417 [3367]: s124 add_lockspace fail result -90 </pre> <pre>Jan 22 00:45:20 office4a vdsm vm.Vm WARNING vmId=`d2983edf-7a76-438f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'unix', 'alias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '1'}}' found </pre> <pre>Jan 22 00:45:20 office4a vdsm vm.Vm WARNING vmId=`d2983edf-7a76-438f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'unix', 'alias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '2'}}' found </pre> <pre>Jan 22 00:45:20 office4a vdsm vm.Vm WARNING vmId=`d2983edf-7a76-438f-9ea3-6e6dd3d6f54b`::Unknown type found, device: '{'device': 'spicevmc', 'alias': 'channel2', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '3'}}' found </pre> <pre>Jan 22 00:45:21 office4a sanlock[3362]: 2014-01-22 00:45:21-0500 521422 [17199]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:45:22 office4a sanlock[3362]: 2014-01-22 00:45:22-0500 521423 [3368]: s125 add_lockspace fail result -90 </pre> <pre>Jan 22 00:45:22 office4a vdsm TaskManager.Task ERROR Task=`2fefe6f9-4dac-45f2-b237-0104ce890146`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/task.py", line 318, in run#012 return self.cmd(*self.argslist, **self.argsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acquireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock lockspace add failure', 'Me ssage too long')) </pre> <pre>Jan 22 00:45:25 office4a sanlock[3362]: 2014-01-22 00:45:25-0500 521426 [17243]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:45:26 office4a sanlock[3362]: 2014-01-22 00:45:26-0500 521427 [3368]: s126 add_lockspace fail result -90 </pre> <pre>Jan 22 00:45:33 office4a vdsm TaskManager.Task ERROR Task=`8f8931ba-a9f6-4225-a951-2d608b3b1d70`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2110, in getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 raise SecureError()#012SecureError </pre> <pre>Jan 22 00:45:36 office4a sanlock[3362]: 2014-01-22 00:45:36-0500 521437 [17334]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:45:37 office4a sanlock[3362]: 2014-01-22 00:45:37-0500 521438 [3368]: s127 add_lockspace fail result -90 </pre> <pre>Jan 22 00:45:37 office4a vdsm TaskManager.Task ERROR Task=`6b7da394-63b6-4e52-ac73-ab98f3f1e60f`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/storage/task.py", line 318, in run#012 return self.cmd(*self.argslist, **self.argsdict)#012 File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012 self.masterDomain.acquireHostId(self.id)#012 File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012 self._clusterLock.acquireHostId(hostId, async)#012 File "/usr/share/vdsm/storage/clusterlock.py", line 185, in acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID, e)#012AcquireHostIdFailure: Cannot acquire host id: ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(115, 'Sanlock lockspace add failure', 'O peration now in progress')) </pre> <pre>Jan 22 00:45:46 office4a sanlock[3362]: 2014-01-22 00:45:46-0500 521447 [17410]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:45:47 office4a sanlock[3362]: 2014-01-22 00:45:47-0500 521448 [3367]: s128 add_lockspace fail result -90 </pre> <pre>Jan 22 00:45:48 office4a vdsm TaskManager.Task ERROR Task=`c2486380-a0c8-45d1-a63f-e0e660f1e118`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 857, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2110, in getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012 File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 raise SecureError()#012SecureError </pre> <pre>Jan 22 00:45:50 office4a sanlock[3362]: 2014-01-22 00:45:50-0500 521451 [17457]: read_sectors delta_leader offset 512 rv -90 /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids </pre> <pre>Jan 22 00:45:51 office4a sanlock[3362]: 2014-01-22 00:45:51-0500 521452 [3368]: s129 add_lockspace fail result -90 </pre> </blockquote> <pre> </pre> </div> <blockquote cite="mid:1916841992.8207492.1390859170488.JavaMail.root@redhat.com" type="cite"> <pre wrap=""> </pre> <blockquote type="cite"> <pre wrap="">I am guessing that what makes cluster storage have the (Master) designation is that this is the one that actually contains the sanlocks? If so, would it make sense to set up a gluster volume to be (Master), but not use it for VM storage, just for storing the sanlock info? Separate gluster volume(s) could then have the VMs on it(them), and would not need the optimizations turned off. </pre> </blockquote> <pre wrap=""> Any domain must be able to become the master at any time. Without a master the data center is unusable (at the present time), that's why we migrate (or reconstruct) it on another domain when necessary. </pre> </blockquote> Ted Miller<br> <br> </body> </html> --------------020403010102000502070801--
participants (4)
-
Dafna Ron
-
Federico Simoncelli
-
Itamar Heim
-
Ted Miller