[Users] Data Center stuck between "Non Responsive" and "Contending"

Itamar Heim iheim at redhat.com
Sun Jan 26 17:17:04 EST 2014


On 01/27/2014 12:00 AM, Ted Miller wrote:
>
> On 1/26/2014 4:00 PM, Itamar Heim wrote:
>> On 01/26/2014 10:51 PM, Ted Miller wrote:
>>>
>>> On 1/26/2014 3:10 PM, Itamar Heim wrote:
>>>> On 01/26/2014 10:08 PM, Ted Miller wrote:
>>>>> My Data Center is down, and won't come back up.
>>>>>
>>>>> Data Center Status on the GUI flips between "Non Responsive" and
>>>>> "Contending"
>>>>>
>>>>> Also noted:
>>>>> Host sometimes seen flipping between "Low" and "Contending" in SPM
>>>>> column.
>>>>> Storage VM2 "Data (Master)" is in "Cross Data-Center Status" = Unknown
>>>>> VM2 is "up" under "Volumes" tab
>>>>>
>>>>> Created another volume for VM storage.  It shows up in "volumes" tab,
>>>>> but when I try to add "New Domain" in storage tab, says that "There
>>>>> are
>>>>> No Data Centers to which the Storage Domain can be attached"
>>>>>
>>>>> Setup:
>>>>> 2 hosts w/ glusterfs storage
>>>>> 1 engine
>>>>> all 3 computers Centos 6.5, just updated
>>>>> ovirt-engine                       3.3.0.1-1.el6
>>>>> ovirt-engine-lib                 3.3.2-1.el6
>>>>> ovirt-host-deploy.noarch  1.1.3-1.el6
>>>>> glusterfs.x86_64               3.4.2-1.el6
>>>>>
>>>>> This loop seems to repeat in the ovirt-engine log (grep of log showing
>>>>> only DefaultQuartzScheduler_Worker-79 thread:
>>>>>
>>>>> 2014-01-26 14:44:58,416 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) Irs placed on server
>>>>> 9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover
>>>>> 2014-01-26 14:44:58,511 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds -
>>>>> office4a,
>>>>> spmStatus Free, storage pool mill
>>>>> 2014-01-26 14:44:58,550 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) SpmStatus on vds
>>>>> 127ed939-34af-41a8-87a0-e2f6174b1877: Free
>>>>> 2014-01-26 14:44:58,571 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) starting spm on vds office4a,
>>>>> storage
>>>>> pool mill, prevId 2, LVER 15
>>>>> 2014-01-26 14:44:58,579 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) START,
>>>>> SpmStartVDSCommand(HostName =
>>>>> office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877,
>>>>> storagePoolId =
>>>>> 536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15,
>>>>> storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log
>>>>> id: 74c38eb7
>>>>> 2014-01-26 14:44:58,617 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) spmStart polling started: taskId =
>>>>> e8986753-fc80-4b11-a11d-6d3470b1728c
>>>>> 2014-01-26 14:45:00,662 ERROR
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS
>>>>> method
>>>>> 2014-01-26 14:45:00,664 ERROR
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) Error code AcquireHostIdFailure and
>>>>> error message VDSGenericException: VDSErrorException: Failed to
>>>>> HSMGetTaskStatusVDS, error = Cannot acquire host id
>>>>> 2014-01-26 14:45:00,665 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId =
>>>>> e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished
>>>>> 2014-01-26 14:45:00,666 ERROR
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) Start SPM Task failed - result:
>>>>> cleanSuccess, message: VDSGenericException: VDSErrorException:
>>>>> Failed to
>>>>> HSMGetTaskStatusVDS, error = Cannot acquire host id
>>>>> 2014-01-26 14:45:00,695 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm
>>>>> status: Free
>>>>> 2014-01-26 14:45:00,702 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) START,
>>>>> HSMClearTaskVDSCommand(HostName = office4a, HostId =
>>>>> 127ed939-34af-41a8-87a0-e2f6174b1877,
>>>>> taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6
>>>>> 2014-01-26 14:45:00,722 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) FINISH, HSMClearTaskVDSCommand, log
>>>>> id: 336ec5a6
>>>>> 2014-01-26 14:45:00,724 INFO
>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand, return:
>>>>> org.ovirt.engine.core.common.businessentities.SpmStatusResult at 13652652,
>>>>>
>>>>> log id: 74c38eb7
>>>>> 2014-01-26 14:45:00,733 INFO
>>>>> [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
>>>>> (DefaultQuartzScheduler_Worker-79) Running command:
>>>>> SetStoragePoolStatusCommand internal: true. Entities affected : ID:
>>>>> 536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool
>>>>> 2014-01-26 14:45:00,778 ERROR
>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>> (DefaultQuartzScheduler_Worker-79)
>>>>> IrsBroker::Failed::GetStoragePoolInfoVDS due to:
>>>>> IrsSpmStartFailedException: IRSGenericException: IRSErrorException:
>>>>> SpmStart failed
>>>>>
>>>>> Ted Miller
>>>>> Elkhart, IN, USA
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>
>>>>
>>>> is this gluster storage (guessing sunce you mentioned a 'volume')
>>> yes (mentioned under "setup" above)
>>>> does it have a quorum?
>>> Volume Name: VM2
>>> Type: Replicate
>>> Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e
>>> Status: Started
>>> Number of Bricks: 1 x 3 = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: 10.41.65.2:/bricks/01/VM2
>>> Brick2: 10.41.65.4:/bricks/01/VM2
>>> Brick3: 10.41.65.4:/bricks/101/VM2
>>> Options Reconfigured:
>>> cluster.server-quorum-type: server
>>> storage.owner-gid: 36
>>> storage.owner-uid: 36
>>> auth.allow: *
>>> user.cifs: off
>>> nfs.disa
>>>> (there were reports of split brain on the domain metadata before when
>>>> no quorum exist for gluster)
>>> after full heal:
>>>
>>> [root at office4a ~]$ gluster volume heal VM2 info
>>> Gathering Heal info on volume VM2 has been successful
>>>
>>> Brick 10.41.65.2:/bricks/01/VM2
>>> Number of entries: 0
>>>
>>> Brick 10.41.65.4:/bricks/01/VM2
>>> Number of entries: 0
>>>
>>> Brick 10.41.65.4:/bricks/101/VM2
>>> Number of entries: 0
>>> [root at office4a ~]$ gluster volume heal VM2 info split-brain
>>> Gathering Heal info on volume VM2 has been successful
>>>
>>> Brick 10.41.65.2:/bricks/01/VM2
>>> Number of entries: 0
>>>
>>> Brick 10.41.65.4:/bricks/01/VM2
>>> Number of entries: 0
>>>
>>> Brick 10.41.65.4:/bricks/101/VM2
>>> Number of entries: 0
>>>
>>> noticed this in host /var/log/messages (while looking for something
>>> else).  Loop seems to repeat over and over.
>>>
>>> Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678
>>> [30419]: read_sectors delta_leader offset 512 rv -90
>>> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
>>>
>>>
>>> Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679
>>> [3771]: s1997 add_lockspace fail result -90
>>> Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR
>>> Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected
>>> error#012Traceback
>>> (most recent call last):#012  File "/usr/share/vdsm/storage/task.py",
>>> line
>>> 857, in _run#012 return fn(*args, **kargs)#012  File
>>> "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012    res = f(*args,
>>> **kwargs)#012  File "/usr/share/vdsm/storage/hsm.py", line 2111, in
>>> getAllTasksStatuses#012    allTasksStatus = sp.getAllTasksStatuses()#012
>>> File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012
>>> raise
>>> SecureError()#012SecureError
>>> Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686
>>> [30495]: read_sectors delta_leader offset 512 rv -90
>>> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
>>>
>>>
>>> Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687
>>> [3772]: s1998 add_lockspace fail result -90
>>> Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR
>>> Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected
>>> error#012Traceback
>>> (most recent call last):#012  File "/usr/share/vdsm/storage/task.py",
>>> line
>>> 857, in _run#012 return fn(*args, **kargs)#012  File
>>> "/usr/share/vdsm/storage/task.py", line 318, in run#012    return
>>> self.cmd(*self.argslist, **self.argsdict)#012 File
>>> "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012
>>> self.masterDomain.acquireHostId(self.id)#012  File
>>> "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012
>>> self._clusterLock.acquireHostId(hostId, async)#012  File
>>> "/usr/share/vdsm/storage/clusterlock.py", line 189, in
>>> acquireHostId#012    raise se.AcquireHostIdFailure(self._sdUUID,
>>> e)#012AcquireHostIdFailure: Cannot acquire host id:
>>> ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock
>>> lockspace add failure', 'Message too long'))

fede - thoughts on above?
(vojtech reported something similar, but it sorted out for him after 
some retries)

>>>
>>> Ted Miller
>>> Elkhart, IN, USA
>>>
>>
>> this is the new storage domain? what about the previous volume for the
>> first SD?
>
> The default/default data center/cluster had to be abandoned because of a
> split-brain that could not be healed.  Can't remove old storage from
> database, can't get data center up due to corrupt storage, ends up a
> circular argument.
>
> I started over with same hosts, totally new storage in new data center.
> This mill/one data center/cluster was working fine with VM2 storage,
> then died.
>
> Ted Miller
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users



More information about the Users mailing list