[Users] Data Center stuck between "Non Responsive" and "Contending"

Mon Jan 27 04:04:58 UTC 2014

On 1/26/2014 6:24 PM, Ted Miller wrote:
> On 1/26/2014 5:17 PM, Itamar Heim wrote:
>> On 01/27/2014 12:00 AM, Ted Miller wrote:
>>>
>>> On 1/26/2014 4:00 PM, Itamar Heim wrote:
>>>> On 01/26/2014 10:51 PM, Ted Miller wrote:
>>>>>
>>>>> On 1/26/2014 3:10 PM, Itamar Heim wrote:
>>>>>> On 01/26/2014 10:08 PM, Ted Miller wrote:
>>>>>>> My Data Center is down, and won't come back up.
>>>>>>>
>>>>>>> Data Center Status on the GUI flips between "Non Responsive" and
>>>>>>> "Contending"
>>>>>>>
>>>>>>> Also noted:
>>>>>>> Host sometimes seen flipping between "Low" and "Contending" in SPM
>>>>>>> column.
>>>>>>> Storage VM2 "Data (Master)" is in "Cross Data-Center Status" = Unknown
>>>>>>> VM2 is "up" under "Volumes" tab
>>>>>>>
>>>>>>> Created another volume for VM storage.  It shows up in "volumes" tab,
>>>>>>> but when I try to add "New Domain" in storage tab, says that "There
>>>>>>> are
>>>>>>> No Data Centers to which the Storage Domain can be attached"
>>>>>>>
>>>>>>> Setup:
>>>>>>> 2 hosts w/ glusterfs storage
>>>>>>> 1 engine
>>>>>>> all 3 computers Centos 6.5, just updated
>>>>>>> ovirt-engine                       3.3.0.1-1.el6
>>>>>>> ovirt-engine-lib                 3.3.2-1.el6
>>>>>>> ovirt-host-deploy.noarch  1.1.3-1.el6
>>>>>>> glusterfs.x86_64               3.4.2-1.el6
>>>>>>>
>>>>>>> This loop seems to repeat in the ovirt-engine log (grep of log showing
>>>>>>> only DefaultQuartzScheduler_Worker-79 thread:
>>>>>>>
>>>>>>> 2014-01-26 14:44:58,416 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) Irs placed on server
>>>>>>> 9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover
>>>>>>> 2014-01-26 14:44:58,511 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds -
>>>>>>> office4a,
>>>>>>> spmStatus Free, storage pool mill
>>>>>>> 2014-01-26 14:44:58,550 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) SpmStatus on vds
>>>>>>> 127ed939-34af-41a8-87a0-e2f6174b1877: Free
>>>>>>> 2014-01-26 14:44:58,571 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) starting spm on vds office4a,
>>>>>>> storage
>>>>>>> pool mill, prevId 2, LVER 15
>>>>>>> 2014-01-26 14:44:58,579 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) START,
>>>>>>> SpmStartVDSCommand(HostName =
>>>>>>> office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877,
>>>>>>> storagePoolId =
>>>>>>> 536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15,
>>>>>>> storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log
>>>>>>> id: 74c38eb7
>>>>>>> 2014-01-26 14:44:58,617 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) spmStart polling started: taskId =
>>>>>>> e8986753-fc80-4b11-a11d-6d3470b1728c
>>>>>>> 2014-01-26 14:45:00,662 ERROR
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS
>>>>>>> method
>>>>>>> 2014-01-26 14:45:00,664 ERROR
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) Error code AcquireHostIdFailure and
>>>>>>> error message VDSGenericException: VDSErrorException: Failed to
>>>>>>> HSMGetTaskStatusVDS, error = Cannot acquire host id
>>>>>>> 2014-01-26 14:45:00,665 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId =
>>>>>>> e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished
>>>>>>> 2014-01-26 14:45:00,666 ERROR
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) Start SPM Task failed - result:
>>>>>>> cleanSuccess, message: VDSGenericException: VDSErrorException:
>>>>>>> Failed to
>>>>>>> HSMGetTaskStatusVDS, error = Cannot acquire host id
>>>>>>> 2014-01-26 14:45:00,695 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm
>>>>>>> status: Free
>>>>>>> 2014-01-26 14:45:00,702 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) START,
>>>>>>> HSMClearTaskVDSCommand(HostName = office4a, HostId =
>>>>>>> 127ed939-34af-41a8-87a0-e2f6174b1877,
>>>>>>> taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6
>>>>>>> 2014-01-26 14:45:00,722 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) FINISH, HSMClearTaskVDSCommand, log
>>>>>>> id: 336ec5a6
>>>>>>> 2014-01-26 14:45:00,724 INFO
>>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand, return:
>>>>>>> org.ovirt.engine.core.common.businessentities.SpmStatusResult at 13652652,
>>>>>>>
>>>>>>> log id: 74c38eb7
>>>>>>> 2014-01-26 14:45:00,733 INFO
>>>>>>> [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79) Running command:
>>>>>>> SetStoragePoolStatusCommand internal: true. Entities affected : ID:
>>>>>>> 536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool
>>>>>>> 2014-01-26 14:45:00,778 ERROR
>>>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>>>> (DefaultQuartzScheduler_Worker-79)
>>>>>>> IrsBroker::Failed::GetStoragePoolInfoVDS due to:
>>>>>>> IrsSpmStartFailedException: IRSGenericException: IRSErrorException:
>>>>>>> SpmStart failed
>>>>>>>
>>>>>>> Ted Miller
>>>>>>> Elkhart, IN, USA
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list
>>>>>>> Users at ovirt.org
>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>
>>>>>>
>>>>>> is this gluster storage (guessing sunce you mentioned a 'volume')
>>>>> yes (mentioned under "setup" above)
>>>>>> does it have a quorum?
>>>>> Volume Name: VM2
>>>>> Type: Replicate
>>>>> Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e
>>>>> Status: Started
>>>>> Number of Bricks: 1 x 3 = 3
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: 10.41.65.2:/bricks/01/VM2
>>>>> Brick2: 10.41.65.4:/bricks/01/VM2
>>>>> Brick3: 10.41.65.4:/bricks/101/VM2
>>>>> Options Reconfigured:
>>>>> cluster.server-quorum-type: server
>>>>> storage.owner-gid: 36
>>>>> storage.owner-uid: 36
>>>>> auth.allow: *
>>>>> user.cifs: off
>>>>> nfs.disa
>>>>>> (there were reports of split brain on the domain metadata before when
>>>>>> no quorum exist for gluster)
>>>>> after full heal:
>>>>>
>>>>> [root at office4a ~]$ gluster volume heal VM2 info
>>>>> Gathering Heal info on volume VM2 has been successful
>>>>>
>>>>> Brick 10.41.65.2:/bricks/01/VM2
>>>>> Number of entries: 0
>>>>>
>>>>> Brick 10.41.65.4:/bricks/01/VM2
>>>>> Number of entries: 0
>>>>>
>>>>> Brick 10.41.65.4:/bricks/101/VM2
>>>>> Number of entries: 0
>>>>> [root at office4a ~]$ gluster volume heal VM2 info split-brain
>>>>> Gathering Heal info on volume VM2 has been successful
>>>>>
>>>>> Brick 10.41.65.2:/bricks/01/VM2
>>>>> Number of entries: 0
>>>>>
>>>>> Brick 10.41.65.4:/bricks/01/VM2
>>>>> Number of entries: 0
>>>>>
>>>>> Brick 10.41.65.4:/bricks/101/VM2
>>>>> Number of entries: 0
>>>>>
>>>>> noticed this in host /var/log/messages (while looking for something
>>>>> else).  Loop seems to repeat over and over.
>>>>>
>>>>> Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678
>>>>> [30419]: read_sectors delta_leader offset 512 rv -90
>>>>> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids 
>>>>>
>>>>>
>>>>>
>>>>> Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679
>>>>> [3771]: s1997 add_lockspace fail result -90
>>>>> Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR
>>>>> Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected
>>>>> error#012Traceback
>>>>> (most recent call last):#012  File "/usr/share/vdsm/storage/task.py",
>>>>> line
>>>>> 857, in _run#012 return fn(*args, **kargs)#012  File
>>>>> "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args,
>>>>> **kwargs)#012  File "/usr/share/vdsm/storage/hsm.py", line 2111, in
>>>>> getAllTasksStatuses#012    allTasksStatus = sp.getAllTasksStatuses()#012
>>>>> File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012
>>>>> raise
>>>>> SecureError()#012SecureError
>>>>> Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686
>>>>> [30495]: read_sectors delta_leader offset 512 rv -90
>>>>> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids 
>>>>>
>>>>>
>>>>>
>>>>> Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687
>>>>> [3772]: s1998 add_lockspace fail result -90
>>>>> Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR
>>>>> Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected
>>>>> error#012Traceback
>>>>> (most recent call last):#012  File "/usr/share/vdsm/storage/task.py",
>>>>> line
>>>>> 857, in _run#012 return fn(*args, **kargs)#012  File
>>>>> "/usr/share/vdsm/storage/task.py", line 318, in run#012 return
>>>>> self.cmd(*self.argslist, **self.argsdict)#012 File
>>>>> "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012
>>>>> self.masterDomain.acquireHostId(self.id)#012  File
>>>>> "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012
>>>>> self._clusterLock.acquireHostId(hostId, async)#012  File
>>>>> "/usr/share/vdsm/storage/clusterlock.py", line 189, in
>>>>> acquireHostId#012    raise se.AcquireHostIdFailure(self._sdUUID,
>>>>> e)#012AcquireHostIdFailure: Cannot acquire host id:
>>>>> ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock
>>>>> lockspace add failure', 'Message too long'))
>>
>> fede - thoughts on above?
>> (vojtech reported something similar, but it sorted out for him after some 
>> retries)
>>
>>>>>
>>>>> Ted Miller
>>>>> Elkhart, IN, USA
>>>>>
>>>>
>>>> this is the new storage domain? what about the previous volume for the
>>>> first SD?
>>>
>>> The default/default data center/cluster had to be abandoned because of a
>>> split-brain that could not be healed.  Can't remove old storage from
>>> database, can't get data center up due to corrupt storage, ends up a
>>> circular argument.
>>>
>>> I started over with same hosts, totally new storage in new data center.
>>> This mill/one data center/cluster was working fine with VM2 storage,
>>> then died.
>>>
>>> Ted Miller 
> Looking at the problems with Sanlock, I read the page on it, then went 
> poking around.
>
> I find the sanlock.log is full of:
>
> 2014-01-26 17:26:30-0500 21316 [3772]: s3071 lockspace 
> 0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0
> 2014-01-26 17:26:30-0500 21316 [11733]: 0322a407 aio collect 0 
> 0x7f97380008c0:0x7f97380008d0:0x7f9738101000 result 0:0 match len 512
> 2014-01-26 17:26:30-0500 21316 [11733]: read_sectors delta_leader offset 
> 512 rv -90 
> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
> 2014-01-26 17:26:31-0500 21317 [3772]: s3071 add_lockspace fail result -90
>
> Just looking at the log, it looks like sanlock is trying to "read_sectors 
> delta_leader offset 512", probably in this file: 
> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
>
> Well, I can see why it fails -- the ids file is empty.
> [root at office4a ~]$ ls 
> /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ 
> -l
> total 1029
> -rw-rw---- 1 vdsm kvm       0 Jan 22 00:44 ids
> -rw-rw---- 1 vdsm kvm       0 Jan 16 18:50 inbox
> -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases
> -rw-r--r-- 1 vdsm kvm     491 Jan 21 18:20 metadata
> -rw-rw---- 1 vdsm kvm       0 Jan 16 18:50 outbox
>
> I have no idea why it is empty, or what is supposed to be there, but 
> anything trying to read with an offset of 512 is going to kick back an 
> error on a 0 length file. :(
>
> I recall that on the previous ids file, on the domain that went up in smoke 
> (improperly recovered split-brain), the file was about 1 megabyte.
>
> The date on the 'ids' file is probably about when the Data Center went 
> belly-up.
>
> I hope this helps someone figure out what I can do to get out of this 
> endless circle of trying to start the Data Center.  Maybe there is a way to 
> force an initialization of the ids file, and then everyone could be happy?
>
> Ted Miller
Since this has turned into a quest that seems to be focused on sanlock, I 
have started a new thread under the subject: "sanlock can't read empty 'ids' 
file".
Ted Miller

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140126/fc55fcd0/attachment-0001.html>