On 01/26/2014 10:51 PM, Ted Miller wrote:
>
> On 1/26/2014 3:10 PM, Itamar Heim wrote:
>> On 01/26/2014 10:08 PM, Ted Miller wrote:
>>> My Data Center is down, and won't come back up.
>>>
>>> Data Center Status on the GUI flips between "Non Responsive" and
>>> "Contending"
>>>
>>> Also noted:
>>> Host sometimes seen flipping between "Low" and
"Contending" in SPM
>>> column.
>>> Storage VM2 "Data (Master)" is in "Cross Data-Center
Status" = Unknown
>>> VM2 is "up" under "Volumes" tab
>>>
>>> Created another volume for VM storage. It shows up in "volumes"
tab,
>>> but when I try to add "New Domain" in storage tab, says that
"There are
>>> No Data Centers to which the Storage Domain can be attached"
>>>
>>> Setup:
>>> 2 hosts w/ glusterfs storage
>>> 1 engine
>>> all 3 computers Centos 6.5, just updated
>>> ovirt-engine 3.3.0.1-1.el6
>>> ovirt-engine-lib 3.3.2-1.el6
>>> ovirt-host-deploy.noarch 1.1.3-1.el6
>>> glusterfs.x86_64 3.4.2-1.el6
>>>
>>> This loop seems to repeat in the ovirt-engine log (grep of log showing
>>> only DefaultQuartzScheduler_Worker-79 thread:
>>>
>>> 2014-01-26 14:44:58,416 INFO
>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>> (DefaultQuartzScheduler_Worker-79) Irs placed on server
>>> 9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover
>>> 2014-01-26 14:44:58,511 INFO
>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>> (DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds - office4a,
>>> spmStatus Free, storage pool mill
>>> 2014-01-26 14:44:58,550 INFO
>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>> (DefaultQuartzScheduler_Worker-79) SpmStatus on vds
>>> 127ed939-34af-41a8-87a0-e2f6174b1877: Free
>>> 2014-01-26 14:44:58,571 INFO
>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>> (DefaultQuartzScheduler_Worker-79) starting spm on vds office4a, storage
>>> pool mill, prevId 2, LVER 15
>>> 2014-01-26 14:44:58,579 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>> (DefaultQuartzScheduler_Worker-79) START, SpmStartVDSCommand(HostName =
>>> office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877, storagePoolId =
>>> 536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15,
>>> storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log
>>> id: 74c38eb7
>>> 2014-01-26 14:44:58,617 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>> (DefaultQuartzScheduler_Worker-79) spmStart polling started: taskId =
>>> e8986753-fc80-4b11-a11d-6d3470b1728c
>>> 2014-01-26 14:45:00,662 ERROR
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
>>> (DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS method
>>> 2014-01-26 14:45:00,664 ERROR
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
>>> (DefaultQuartzScheduler_Worker-79) Error code AcquireHostIdFailure and
>>> error message VDSGenericException: VDSErrorException: Failed to
>>> HSMGetTaskStatusVDS, error = Cannot acquire host id
>>> 2014-01-26 14:45:00,665 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>> (DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId =
>>> e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished
>>> 2014-01-26 14:45:00,666 ERROR
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>> (DefaultQuartzScheduler_Worker-79) Start SPM Task failed - result:
>>> cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to
>>> HSMGetTaskStatusVDS, error = Cannot acquire host id
>>> 2014-01-26 14:45:00,695 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>> (DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm
>>> status: Free
>>> 2014-01-26 14:45:00,702 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
>>> (DefaultQuartzScheduler_Worker-79) START,
>>> HSMClearTaskVDSCommand(HostName = office4a, HostId =
>>> 127ed939-34af-41a8-87a0-e2f6174b1877,
>>> taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6
>>> 2014-01-26 14:45:00,722 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
>>> (DefaultQuartzScheduler_Worker-79) FINISH, HSMClearTaskVDSCommand, log
>>> id: 336ec5a6
>>> 2014-01-26 14:45:00,724 INFO
>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>> (DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand, return:
>>> org.ovirt.engine.core.common.businessentities.SpmStatusResult@13652652,
>>> log id: 74c38eb7
>>> 2014-01-26 14:45:00,733 INFO
>>> [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
>>> (DefaultQuartzScheduler_Worker-79) Running command:
>>> SetStoragePoolStatusCommand internal: true. Entities affected : ID:
>>> 536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool
>>> 2014-01-26 14:45:00,778 ERROR
>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>> (DefaultQuartzScheduler_Worker-79)
>>> IrsBroker::Failed::GetStoragePoolInfoVDS due to:
>>> IrsSpmStartFailedException: IRSGenericException: IRSErrorException:
>>> SpmStart failed
>>>
>>> Ted Miller
>>> Elkhart, IN, USA
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users(a)ovirt.org
>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>
>>
>> is this gluster storage (guessing sunce you mentioned a 'volume')
> yes (mentioned under "setup" above)
>> does it have a quorum?
> Volume Name: VM2
> Type: Replicate
> Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 10.41.65.2:/bricks/01/VM2
> Brick2: 10.41.65.4:/bricks/01/VM2
> Brick3: 10.41.65.4:/bricks/101/VM2
> Options Reconfigured:
> cluster.server-quorum-type: server
> storage.owner-gid: 36
> storage.owner-uid: 36
> auth.allow: *
> user.cifs: off
> nfs.disa
>> (there were reports of split brain on the domain metadata before when
>> no quorum exist for gluster)
> after full heal:
>
> [root@office4a ~]$ gluster volume heal VM2 info
> Gathering Heal info on volume VM2 has been successful
>
> Brick 10.41.65.2:/bricks/01/VM2
> Number of entries: 0
>
> Brick 10.41.65.4:/bricks/01/VM2
> Number of entries: 0
>
> Brick 10.41.65.4:/bricks/101/VM2
> Number of entries: 0
> [root@office4a ~]$ gluster volume heal VM2 info split-brain
> Gathering Heal info on volume VM2 has been successful
>
> Brick 10.41.65.2:/bricks/01/VM2
> Number of entries: 0
>
> Brick 10.41.65.4:/bricks/01/VM2
> Number of entries: 0
>
> Brick 10.41.65.4:/bricks/101/VM2
> Number of entries: 0
>
> noticed this in host /var/log/messages (while looking for something
> else). Loop seems to repeat over and over.
>
> Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678
> [30419]: read_sectors delta_leader offset 512 rv -90
>
/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
>
> Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679
> [3771]: s1997 add_lockspace fail result -90
> Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR
> Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected error#012Traceback
> (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line
> 857, in _run#012 return fn(*args, **kargs)#012 File
> "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args,
> **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 2111, in
> getAllTasksStatuses#012 allTasksStatus = sp.getAllTasksStatuses()#012
> File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012 raise
> SecureError()#012SecureError
> Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686
> [30495]: read_sectors delta_leader offset 512 rv -90
>
/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
>
> Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687
> [3772]: s1998 add_lockspace fail result -90
> Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR
> Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected error#012Traceback
> (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line
> 857, in _run#012 return fn(*args, **kargs)#012 File
> "/usr/share/vdsm/storage/task.py", line 318, in run#012 return
> self.cmd(*self.argslist, **self.argsdict)#012 File
> "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012
> self.masterDomain.acquireHostId(self.id)#012 File
> "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012
> self._clusterLock.acquireHostId(hostId, async)#012 File
> "/usr/share/vdsm/storage/clusterlock.py", line 189, in
> acquireHostId#012 raise se.AcquireHostIdFailure(self._sdUUID,
> e)#012AcquireHostIdFailure: Cannot acquire host id:
> ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock
> lockspace add failure', 'Message too long'))
>
> Ted Miller
> Elkhart, IN, USA
>
this is the new storage domain? what about the previous volume for the
first SD?
The default/default data center/cluster had to be abandoned because of a
split-brain that could not be healed. Can't remove old storage from
database, can't get data center up due to corrupt storage, ends up a circular
argument.
I started over with same hosts, totally new storage in new data center. This
mill/one data center/cluster was working fine with VM2 storage, then died.
Ted Miller