[Users] Data Center stuck between "Non Responsive" and "Contending"

Dafna Ron dron at redhat.com
Mon Jan 27 09:12:26 UTC 2014


I'm adding Vijay to see if he can help here.

Dafna


On 01/27/2014 08:47 AM, Federico Simoncelli wrote:
> ----- Original Message -----
>> From: "Itamar Heim" <iheim at redhat.com>
>> To: "Ted Miller" <tmiller at hcjb.org>, users at ovirt.org, "Federico Simoncelli" <fsimonce at redhat.com>
>> Cc: "Allon Mureinik" <amureini at redhat.com>
>> Sent: Sunday, January 26, 2014 11:17:04 PM
>> Subject: Re: [Users] Data Center stuck between "Non Responsive" and "Contending"
>>
>> On 01/27/2014 12:00 AM, Ted Miller wrote:
>>> On 1/26/2014 4:00 PM, Itamar Heim wrote:
>>>> On 01/26/2014 10:51 PM, Ted Miller wrote:
>>>>> On 1/26/2014 3:10 PM, Itamar Heim wrote:
>>>>>> On 01/26/2014 10:08 PM, Ted Miller wrote:
>>>>>> is this gluster storage (guessing sunce you mentioned a 'volume')
>>>>> yes (mentioned under "setup" above)
>>>>>> does it have a quorum?
>>>>> Volume Name: VM2
>>>>> Type: Replicate
>>>>> Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e
>>>>> Status: Started
>>>>> Number of Bricks: 1 x 3 = 3
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: 10.41.65.2:/bricks/01/VM2
>>>>> Brick2: 10.41.65.4:/bricks/01/VM2
>>>>> Brick3: 10.41.65.4:/bricks/101/VM2
>>>>> Options Reconfigured:
>>>>> cluster.server-quorum-type: server
>>>>> storage.owner-gid: 36
>>>>> storage.owner-uid: 36
>>>>> auth.allow: *
>>>>> user.cifs: off
>>>>> nfs.disa
>>>>>> (there were reports of split brain on the domain metadata before when
>>>>>> no quorum exist for gluster)
>>>>> after full heal:
>>>>>
>>>>> [root at office4a ~]$ gluster volume heal VM2 info
>>>>> Gathering Heal info on volume VM2 has been successful
>>>>>
>>>>> Brick 10.41.65.2:/bricks/01/VM2
>>>>> Number of entries: 0
>>>>>
>>>>> Brick 10.41.65.4:/bricks/01/VM2
>>>>> Number of entries: 0
>>>>>
>>>>> Brick 10.41.65.4:/bricks/101/VM2
>>>>> Number of entries: 0
>>>>> [root at office4a ~]$ gluster volume heal VM2 info split-brain
>>>>> Gathering Heal info on volume VM2 has been successful
>>>>>
>>>>> Brick 10.41.65.2:/bricks/01/VM2
>>>>> Number of entries: 0
>>>>>
>>>>> Brick 10.41.65.4:/bricks/01/VM2
>>>>> Number of entries: 0
>>>>>
>>>>> Brick 10.41.65.4:/bricks/101/VM2
>>>>> Number of entries: 0
>>>>>
>>>>> noticed this in host /var/log/messages (while looking for something
>>>>> else).  Loop seems to repeat over and over.
>>>>>
>>>>> Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678
>>>>> [30419]: read_sectors delta_leader offset 512 rv -90
>>>>> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
>>>>>
>>>>>
>>>>> Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679
>>>>> [3771]: s1997 add_lockspace fail result -90
>>>>> Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR
>>>>> Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected
>>>>> error#012Traceback
>>>>> (most recent call last):#012  File "/usr/share/vdsm/storage/task.py",
>>>>> line
>>>>> 857, in _run#012 return fn(*args, **kargs)#012  File
>>>>> "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012    res = f(*args,
>>>>> **kwargs)#012  File "/usr/share/vdsm/storage/hsm.py", line 2111, in
>>>>> getAllTasksStatuses#012    allTasksStatus = sp.getAllTasksStatuses()#012
>>>>> File "/usr/share/vdsm/storage/securable.py", line 66, in wrapper#012
>>>>> raise
>>>>> SecureError()#012SecureError
>>>>> Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686
>>>>> [30495]: read_sectors delta_leader offset 512 rv -90
>>>>> /rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
>>>>>
>>>>>
>>>>> Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687
>>>>> [3772]: s1998 add_lockspace fail result -90
>>>>> Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR
>>>>> Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected
>>>>> error#012Traceback
>>>>> (most recent call last):#012  File "/usr/share/vdsm/storage/task.py",
>>>>> line
>>>>> 857, in _run#012 return fn(*args, **kargs)#012  File
>>>>> "/usr/share/vdsm/storage/task.py", line 318, in run#012    return
>>>>> self.cmd(*self.argslist, **self.argsdict)#012 File
>>>>> "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012
>>>>> self.masterDomain.acquireHostId(self.id)#012  File
>>>>> "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId#012
>>>>> self._clusterLock.acquireHostId(hostId, async)#012  File
>>>>> "/usr/share/vdsm/storage/clusterlock.py", line 189, in
>>>>> acquireHostId#012    raise se.AcquireHostIdFailure(self._sdUUID,
>>>>> e)#012AcquireHostIdFailure: Cannot acquire host id:
>>>>> ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90, 'Sanlock
>>>>> lockspace add failure', 'Message too long'))
>> fede - thoughts on above?
>> (vojtech reported something similar, but it sorted out for him after
>> some retries)
> Something truncated the ids file, as also reported by:
>
>> [root at office4a ~]$ ls
>> /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/
>> -l
>> total 1029
>> -rw-rw---- 1 vdsm kvm 0 Jan 22 00:44 ids
>> -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 inbox
>> -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases
>> -rw-r--r-- 1 vdsm kvm 491 Jan 21 18:20 metadata
>> -rw-rw---- 1 vdsm kvm 0 Jan 16 18:50 outbox
> In the past I saw that happening because of a glusterfs bug:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=862975
>
> Anyway in general it seems that glusterfs is not always able to reconcile
> the ids file (as it's written by all the hosts at the same time).
>
> Maybe someone from gluster can identify easily what happened. Meanwhile if
> you just want to repair your data-center you could try with:
>
>   $ cd /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/
>   $ touch ids
>   $ sanlock direct init -s 0322a407-2b16-40dc-ac67-13d387c6eb4c:0:ids:1048576
>

-- 
Dafna Ron



More information about the Users mailing list