Re: [Users] Data Center stuck between "Non Responsive" and "Contending"

Sunday, 26 January 2014

--------------080006060704060003070706
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit

On 1/26/2014 6:24 PM, Ted Miller wrote:
...
 On 1/26/2014 5:17 PM, Itamar Heim wrote:
> On 01/27/2014 12:00 AM, Ted Miller wrote:
>>
>> On 1/26/2014 4:00 PM, Itamar Heim wrote:
>>> On 01/26/2014 10:51 PM, Ted Miller wrote:
>>>>
>>>> On 1/26/2014 3:10 PM, Itamar Heim wrote:
>>>>> On 01/26/2014 10:08 PM, Ted Miller wrote:
>>>>>> My Data Center is down, and won't come back up.
>>>>>>
>>>>>> Data Center Status on the GUI flips between "Non
Responsive" and
>>>>>> "Contending"
>>>>>>
>>>>>> Also noted:
>>>>>> Host sometimes seen flipping between "Low" and
"Contending" in SPM
>>>>>> column.
>>>>>> Storage VM2 "Data (Master)" is in "Cross
Data-Center Status" = Unknown
>>>>>> VM2 is "up" under "Volumes" tab
>>>>>>
>>>>>> Created another volume for VM storage.  It shows up in
"volumes" tab,
>>>>>> but when I try to add "New Domain" in storage tab, says
that "There
>>>>>> are
>>>>>> No Data Centers to which the Storage Domain can be
attached"
>>>>>>
>>>>>> Setup:
>>>>>> 2 hosts w/ glusterfs storage
>>>>>> 1 engine
>>>>>> all 3 computers Centos 6.5, just updated
>>>>>> ovirt-engine                       3.3.0.1-1.el6
>>>>>> ovirt-engine-lib                 3.3.2-1.el6
>>>>>> ovirt-host-deploy.noarch  1.1.3-1.el6
>>>>>> glusterfs.x86_64               3.4.2-1.el6
>>>>>>
>>>>>> This loop seems to repeat in the ovirt-engine log (grep of log
showing
>>>>>> only DefaultQuartzScheduler_Worker-79 thread:
>>>>>>
>>>>>> 2014-01-26 14:44:58,416 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) Irs placed on server
>>>>>> 9a591103-83be-4ca9-b207-06929223b541 failed. Proceed Failover
>>>>>> 2014-01-26 14:44:58,511 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) hostFromVds::selectedVds -
>>>>>> office4a,
>>>>>> spmStatus Free, storage pool mill
>>>>>> 2014-01-26 14:44:58,550 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) SpmStatus on vds
>>>>>> 127ed939-34af-41a8-87a0-e2f6174b1877: Free
>>>>>> 2014-01-26 14:44:58,571 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) starting spm on vds office4a,
>>>>>> storage
>>>>>> pool mill, prevId 2, LVER 15
>>>>>> 2014-01-26 14:44:58,579 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) START,
>>>>>> SpmStartVDSCommand(HostName =
>>>>>> office4a, HostId = 127ed939-34af-41a8-87a0-e2f6174b1877,
>>>>>> storagePoolId =
>>>>>> 536a864d-83aa-473a-a675-e38aafdd9071, prevId=2, prevLVER=15,
>>>>>> storagePoolFormatType=V3, recoveryMode=Manual,
SCSIFencing=false), log
>>>>>> id: 74c38eb7
>>>>>> 2014-01-26 14:44:58,617 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) spmStart polling started:
taskId =
>>>>>> e8986753-fc80-4b11-a11d-6d3470b1728c
>>>>>> 2014-01-26 14:45:00,662 ERROR
>>>>>>
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) Failed in HSMGetTaskStatusVDS
>>>>>> method
>>>>>> 2014-01-26 14:45:00,664 ERROR
>>>>>>
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) Error code
AcquireHostIdFailure and
>>>>>> error message VDSGenericException: VDSErrorException: Failed to
>>>>>> HSMGetTaskStatusVDS, error = Cannot acquire host id
>>>>>> 2014-01-26 14:45:00,665 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) spmStart polling ended: taskId
=
>>>>>> e8986753-fc80-4b11-a11d-6d3470b1728c task status = finished
>>>>>> 2014-01-26 14:45:00,666 ERROR
>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) Start SPM Task failed -
result:
>>>>>> cleanSuccess, message: VDSGenericException: VDSErrorException:
>>>>>> Failed to
>>>>>> HSMGetTaskStatusVDS, error = Cannot acquire host id
>>>>>> 2014-01-26 14:45:00,695 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) spmStart polling ended, spm
>>>>>> status: Free
>>>>>> 2014-01-26 14:45:00,702 INFO
>>>>>>
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) START,
>>>>>> HSMClearTaskVDSCommand(HostName = office4a, HostId =
>>>>>> 127ed939-34af-41a8-87a0-e2f6174b1877,
>>>>>> taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id: 336ec5a6
>>>>>> 2014-01-26 14:45:00,722 INFO
>>>>>>
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) FINISH,
HSMClearTaskVDSCommand, log
>>>>>> id: 336ec5a6
>>>>>> 2014-01-26 14:45:00,724 INFO
>>>>>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) FINISH, SpmStartVDSCommand,
return:
>>>>>>
org.ovirt.engine.core.common.businessentities.SpmStatusResult@13652652,
>>>>>>
>>>>>> log id: 74c38eb7
>>>>>> 2014-01-26 14:45:00,733 INFO
>>>>>> [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79) Running command:
>>>>>> SetStoragePoolStatusCommand internal: true. Entities affected :
ID:
>>>>>> 536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool
>>>>>> 2014-01-26 14:45:00,778 ERROR
>>>>>> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
>>>>>> (DefaultQuartzScheduler_Worker-79)
>>>>>> IrsBroker::Failed::GetStoragePoolInfoVDS due to:
>>>>>> IrsSpmStartFailedException: IRSGenericException:
IRSErrorException:
>>>>>> SpmStart failed
>>>>>>
>>>>>> Ted Miller
>>>>>> Elkhart, IN, USA
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users(a)ovirt.org
>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>
>>>>>
>>>>> is this gluster storage (guessing sunce you mentioned a
'volume')
>>>> yes (mentioned under "setup" above)
>>>>> does it have a quorum?
>>>> Volume Name: VM2
>>>> Type: Replicate
>>>> Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e
>>>> Status: Started
>>>> Number of Bricks: 1 x 3 = 3
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: 10.41.65.2:/bricks/01/VM2
>>>> Brick2: 10.41.65.4:/bricks/01/VM2
>>>> Brick3: 10.41.65.4:/bricks/101/VM2
>>>> Options Reconfigured:
>>>> cluster.server-quorum-type: server
>>>> storage.owner-gid: 36
>>>> storage.owner-uid: 36
>>>> auth.allow: *
>>>> user.cifs: off
>>>> nfs.disa
>>>>> (there were reports of split brain on the domain metadata before
when
>>>>> no quorum exist for gluster)
>>>> after full heal:
>>>>
>>>> [root@office4a ~]$ gluster volume heal VM2 info
>>>> Gathering Heal info on volume VM2 has been successful
>>>>
>>>> Brick 10.41.65.2:/bricks/01/VM2
>>>> Number of entries: 0
>>>>
>>>> Brick 10.41.65.4:/bricks/01/VM2
>>>> Number of entries: 0
>>>>
>>>> Brick 10.41.65.4:/bricks/101/VM2
>>>> Number of entries: 0
>>>> [root@office4a ~]$ gluster volume heal VM2 info split-brain
>>>> Gathering Heal info on volume VM2 has been successful
>>>>
>>>> Brick 10.41.65.2:/bricks/01/VM2
>>>> Number of entries: 0
>>>>
>>>> Brick 10.41.65.4:/bricks/01/VM2
>>>> Number of entries: 0
>>>>
>>>> Brick 10.41.65.4:/bricks/101/VM2
>>>> Number of entries: 0
>>>>
>>>> noticed this in host /var/log/messages (while looking for something
>>>> else).  Loop seems to repeat over and over.
>>>>
>>>> Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26 15:35:52-0500 14678
>>>> [30419]: read_sectors delta_leader offset 512 rv -90
>>>>
/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids

>>>>
>>>>
>>>>
>>>> Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26 15:35:53-0500 14679
>>>> [3771]: s1997 add_lockspace fail result -90
>>>> Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR
>>>> Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected
>>>> error#012Traceback
>>>> (most recent call last):#012  File
"/usr/share/vdsm/storage/task.py",
>>>> line
>>>> 857, in _run#012 return fn(*args, **kargs)#012  File
>>>> "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res =
f(*args,
>>>> **kwargs)#012  File "/usr/share/vdsm/storage/hsm.py", line
2111, in
>>>> getAllTasksStatuses#012    allTasksStatus = sp.getAllTasksStatuses()#012
>>>> File "/usr/share/vdsm/storage/securable.py", line 66, in
wrapper#012
>>>> raise
>>>> SecureError()#012SecureError
>>>> Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26 15:35:59-0500 14686
>>>> [30495]: read_sectors delta_leader offset 512 rv -90
>>>>
/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids

>>>>
>>>>
>>>>
>>>> Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26 15:36:00-0500 14687
>>>> [3772]: s1998 add_lockspace fail result -90
>>>> Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR
>>>> Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected
>>>> error#012Traceback
>>>> (most recent call last):#012  File
"/usr/share/vdsm/storage/task.py",
>>>> line
>>>> 857, in _run#012 return fn(*args, **kargs)#012  File
>>>> "/usr/share/vdsm/storage/task.py", line 318, in run#012 return
>>>> self.cmd(*self.argslist, **self.argsdict)#012 File
>>>> "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012
>>>> self.masterDomain.acquireHostId(self.id)#012  File
>>>> "/usr/share/vdsm/storage/sd.py", line 458, in
acquireHostId#012
>>>> self._clusterLock.acquireHostId(hostId, async)#012  File
>>>> "/usr/share/vdsm/storage/clusterlock.py", line 189, in
>>>> acquireHostId#012    raise se.AcquireHostIdFailure(self._sdUUID,
>>>> e)#012AcquireHostIdFailure: Cannot acquire host id:
>>>> ('0322a407-2b16-40dc-ac67-13d387c6eb4c', SanlockException(90,
'Sanlock
>>>> lockspace add failure', 'Message too long'))
>
> fede - thoughts on above?
> (vojtech reported something similar, but it sorted out for him after some 
> retries)
>
>>>>
>>>> Ted Miller
>>>> Elkhart, IN, USA
>>>>
>>>
>>> this is the new storage domain? what about the previous volume for the
>>> first SD?
>>
>> The default/default data center/cluster had to be abandoned because of a
>> split-brain that could not be healed.  Can't remove old storage from
>> database, can't get data center up due to corrupt storage, ends up a
>> circular argument.
>>
>> I started over with same hosts, totally new storage in new data center.
>> This mill/one data center/cluster was working fine with VM2 storage,
>> then died.
>>
>> Ted Miller 
 Looking at the problems with Sanlock, I read the page on it, then went 
 poking around.

 I find the sanlock.log is full of:

 2014-01-26 17:26:30-0500 21316 [3772]: s3071 lockspace 

0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0
 2014-01-26 17:26:30-0500 21316 [11733]: 0322a407 aio collect 0 
 0x7f97380008c0:0x7f97380008d0:0x7f9738101000 result 0:0 match len 512
 2014-01-26 17:26:30-0500 21316 [11733]: read_sectors delta_leader offset 
 512 rv -90 

/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids
 2014-01-26 17:26:31-0500 21317 [3772]: s3071 add_lockspace fail result -90

 Just looking at the log, it looks like sanlock is trying to "read_sectors 
 delta_leader offset 512", probably in this file: 

/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids

 Well, I can see why it fails -- the ids file is empty.
 [root@office4a ~]$ ls 

/rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/

 -l
 total 1029
 -rw-rw---- 1 vdsm kvm       0 Jan 22 00:44 ids
 -rw-rw---- 1 vdsm kvm       0 Jan 16 18:50 inbox
 -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases
 -rw-r--r-- 1 vdsm kvm     491 Jan 21 18:20 metadata
 -rw-rw---- 1 vdsm kvm       0 Jan 16 18:50 outbox

 I have no idea why it is empty, or what is supposed to be there, but 
 anything trying to read with an offset of 512 is going to kick back an 
 error on a 0 length file. :(

 I recall that on the previous ids file, on the domain that went up in smoke 
 (improperly recovered split-brain), the file was about 1 megabyte.

 The date on the 'ids' file is probably about when the Data Center went 
 belly-up.

 I hope this helps someone figure out what I can do to get out of this 
 endless circle of trying to start the Data Center.  Maybe there is a way to 
 force an initialization of the ids file, and then everyone could be happy?

 Ted Miller Since this has turned into a quest that seems to be focused on sanlock,
I 
have started a new thread under the subject: "sanlock can't read empty
'ids' 
file".
Ted Miller

--------------080006060704060003070706
Content-Type: text/html; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <br>
    <div class="moz-cite-prefix">On 1/26/2014 6:24 PM, Ted Miller
wrote:<br>
    </div>
    <blockquote cite="mid:52E5992A.2070301@hcjb.org"
type="cite">
      <meta http-equiv="Content-Type" content="text/html;
        charset=ISO-8859-1">
      On 1/26/2014 5:17 PM, Itamar Heim wrote:<br>
      <blockquote cite="mid:52E58960.4060607@redhat.com"
type="cite">On
        01/27/2014 12:00 AM, Ted Miller wrote: <br>
        <blockquote type="cite"> <br>
          On 1/26/2014 4:00 PM, Itamar Heim wrote: <br>
          <blockquote type="cite">On 01/26/2014 10:51 PM, Ted Miller
            wrote: <br>
            <blockquote type="cite"> <br>
              On 1/26/2014 3:10 PM, Itamar Heim wrote: <br>
              <blockquote type="cite">On 01/26/2014 10:08 PM, Ted Miller
                wrote: <br>
                <blockquote type="cite">My Data Center is down, and
                  won't come back up. <br>
                  <br>
                  Data Center Status on the GUI flips between "Non
                  Responsive" and <br>
                  "Contending" <br>
                  <br>
                  Also noted: <br>
                  Host sometimes seen flipping between "Low" and
                  "Contending" in SPM <br>
                  column. <br>
                  Storage VM2 "Data (Master)" is in "Cross Data-Center
                  Status" = Unknown <br>
                  VM2 is "up" under "Volumes" tab <br>
                  <br>
                  Created another volume for VM storage.&nbsp; It shows up in
                  "volumes" tab, <br>
                  but when I try to add "New Domain" in storage tab,
                  says that "There <br>
                  are <br>
                  No Data Centers to which the Storage Domain can be
                  attached" <br>
                  <br>
                  Setup: <br>
                  2 hosts w/ glusterfs storage <br>
                  1 engine <br>
                  all 3 computers Centos 6.5, just updated <br>

ovirt-engine&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
3.3.0.1-1.el6 <br>

ovirt-engine-lib&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
3.3.2-1.el6 <br>
                  ovirt-host-deploy.noarch&nbsp; 1.1.3-1.el6 <br>

glusterfs.x86_64&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
3.4.2-1.el6 <br>
                  <br>
                  This loop seems to repeat in the ovirt-engine log
                  (grep of log showing <br>
                  only DefaultQuartzScheduler_Worker-79 thread: <br>
                  <br>
                  2014-01-26 14:44:58,416 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) Irs placed on
                  server <br>
                  9a591103-83be-4ca9-b207-06929223b541 failed. Proceed
                  Failover <br>
                  2014-01-26 14:44:58,511 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79)
                  hostFromVds::selectedVds - <br>
                  office4a, <br>
                  spmStatus Free, storage pool mill <br>
                  2014-01-26 14:44:58,550 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) SpmStatus on vds <br>
                  127ed939-34af-41a8-87a0-e2f6174b1877: Free <br>
                  2014-01-26 14:44:58,571 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) starting spm on vds
                  office4a, <br>
                  storage <br>
                  pool mill, prevId 2, LVER 15 <br>
                  2014-01-26 14:44:58,579 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) START, <br>
                  SpmStartVDSCommand(HostName = <br>
                  office4a, HostId =
                  127ed939-34af-41a8-87a0-e2f6174b1877, <br>
                  storagePoolId = <br>
                  536a864d-83aa-473a-a675-e38aafdd9071, prevId=2,
                  prevLVER=15, <br>
                  storagePoolFormatType=V3, recoveryMode=Manual,
                  SCSIFencing=false), log <br>
                  id: 74c38eb7 <br>
                  2014-01-26 14:44:58,617 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) spmStart polling
                  started: taskId = <br>
                  e8986753-fc80-4b11-a11d-6d3470b1728c <br>
                  2014-01-26 14:45:00,662 ERROR <br>
                  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) Failed in
                  HSMGetTaskStatusVDS <br>
                  method <br>
                  2014-01-26 14:45:00,664 ERROR <br>
                  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) Error code
                  AcquireHostIdFailure and <br>
                  error message VDSGenericException: VDSErrorException:
                  Failed to <br>
                  HSMGetTaskStatusVDS, error = Cannot acquire host id <br>
                  2014-01-26 14:45:00,665 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) spmStart polling
                  ended: taskId = <br>
                  e8986753-fc80-4b11-a11d-6d3470b1728c task status =
                  finished <br>
                  2014-01-26 14:45:00,666 ERROR <br>
                  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) Start SPM Task
                  failed - result: <br>
                  cleanSuccess, message: VDSGenericException:
                  VDSErrorException: <br>
                  Failed to <br>
                  HSMGetTaskStatusVDS, error = Cannot acquire host id <br>
                  2014-01-26 14:45:00,695 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) spmStart polling
                  ended, spm <br>
                  status: Free <br>
                  2014-01-26 14:45:00,702 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) START, <br>
                  HSMClearTaskVDSCommand(HostName = office4a, HostId = <br>
                  127ed939-34af-41a8-87a0-e2f6174b1877, <br>
                  taskId=e8986753-fc80-4b11-a11d-6d3470b1728c), log id:
                  336ec5a6 <br>
                  2014-01-26 14:45:00,722 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) FINISH,
                  HSMClearTaskVDSCommand, log <br>
                  id: 336ec5a6 <br>
                  2014-01-26 14:45:00,724 INFO <br>
                  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) FINISH,
                  SpmStartVDSCommand, return: <br>
                  org.ovirt.engine.core.common.businessentities.SpmStatusResult@13652652,

                  <br>
                  <br>
                  log id: 74c38eb7 <br>
                  2014-01-26 14:45:00,733 INFO <br>
                  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) Running command: <br>
                  SetStoragePoolStatusCommand internal: true. Entities
                  affected : ID: <br>
                  536a864d-83aa-473a-a675-e38aafdd9071 Type: StoragePool
                  <br>
                  2014-01-26 14:45:00,778 ERROR <br>
                  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]

                  <br>
                  (DefaultQuartzScheduler_Worker-79) <br>
                  IrsBroker::Failed::GetStoragePoolInfoVDS due to: <br>
                  IrsSpmStartFailedException: IRSGenericException:
                  IRSErrorException: <br>
                  SpmStart failed <br>
                  <br>
                  Ted Miller <br>
                  Elkhart, IN, USA <br>
                  <br>
                  <br>
                  <br>
                  _______________________________________________ <br>
                  Users mailing list <br>
                  <a moz-do-not-send="true"
                    class="moz-txt-link-abbreviated"
                    href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<br>
                  <a moz-do-not-send="true"
                    class="moz-txt-link-freetext"

href="http://lists.ovirt.org/mailman/listinfo/users">http://...
                  <br>
                  <br>
                </blockquote>
                <br>
                is this gluster storage (guessing sunce you mentioned a
                'volume') <br>
              </blockquote>
              yes (mentioned under "setup" above) <br>
              <blockquote type="cite">does it have a quorum? <br>
              </blockquote>
              Volume Name: VM2 <br>
              Type: Replicate <br>
              Volume ID: 7bea8d3b-ec2a-4939-8da8-a82e6bda841e <br>
              Status: Started <br>
              Number of Bricks: 1 x 3 = 3 <br>
              Transport-type: tcp <br>
              Bricks: <br>
              Brick1: 10.41.65.2:/bricks/01/VM2 <br>
              Brick2: 10.41.65.4:/bricks/01/VM2 <br>
              Brick3: 10.41.65.4:/bricks/101/VM2 <br>
              Options Reconfigured: <br>
              cluster.server-quorum-type: server <br>
              storage.owner-gid: 36 <br>
              storage.owner-uid: 36 <br>
              auth.allow: * <br>
              user.cifs: off <br>
              nfs.disa <br>
              <blockquote type="cite">(there were reports of split brain
                on the domain metadata before when <br>
                no quorum exist for gluster) <br>
              </blockquote>
              after full heal: <br>
              <br>
              [root@office4a ~]$ gluster volume heal VM2 info <br>
              Gathering Heal info on volume VM2 has been successful <br>
              <br>
              Brick 10.41.65.2:/bricks/01/VM2 <br>
              Number of entries: 0 <br>
              <br>
              Brick 10.41.65.4:/bricks/01/VM2 <br>
              Number of entries: 0 <br>
              <br>
              Brick 10.41.65.4:/bricks/101/VM2 <br>
              Number of entries: 0 <br>
              [root@office4a ~]$ gluster volume heal VM2 info
              split-brain <br>
              Gathering Heal info on volume VM2 has been successful <br>
              <br>
              Brick 10.41.65.2:/bricks/01/VM2 <br>
              Number of entries: 0 <br>
              <br>
              Brick 10.41.65.4:/bricks/01/VM2 <br>
              Number of entries: 0 <br>
              <br>
              Brick 10.41.65.4:/bricks/101/VM2 <br>
              Number of entries: 0 <br>
              <br>
              noticed this in host /var/log/messages (while looking for
              something <br>
              else).&nbsp; Loop seems to repeat over and over. <br>
              <br>
              Jan 26 15:35:52 office4a sanlock[3763]: 2014-01-26
              15:35:52-0500 14678 <br>
              [30419]: read_sectors delta_leader offset 512 rv -90 <br>

/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids

              <br>
              <br>
              <br>
              Jan 26 15:35:53 office4a sanlock[3763]: 2014-01-26
              15:35:53-0500 14679 <br>
              [3771]: s1997 add_lockspace fail result -90 <br>
              Jan 26 15:35:58 office4a vdsm TaskManager.Task ERROR <br>
              Task=`89885661-88eb-4ea3-8793-00438735e4ab`::Unexpected <br>
              error#012Traceback <br>
              (most recent call last):#012&nbsp; File
              "/usr/share/vdsm/storage/task.py", <br>
              line <br>
              857, in _run#012 return fn(*args, **kargs)#012&nbsp; File <br>
              "/usr/share/vdsm/logUtils.py", line 45, in
wrapper#012&nbsp;&nbsp;&nbsp;
              res = f(*args, <br>
              **kwargs)#012&nbsp; File "/usr/share/vdsm/storage/hsm.py",
line
              2111, in <br>
              getAllTasksStatuses#012&nbsp;&nbsp;&nbsp; allTasksStatus =
              sp.getAllTasksStatuses()#012 <br>
              File "/usr/share/vdsm/storage/securable.py", line 66, in
              wrapper#012 <br>
              raise <br>
              SecureError()#012SecureError <br>
              Jan 26 15:35:59 office4a sanlock[3763]: 2014-01-26
              15:35:59-0500 14686 <br>
              [30495]: read_sectors delta_leader offset 512 rv -90 <br>

/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids

              <br>
              <br>
              <br>
              Jan 26 15:36:00 office4a sanlock[3763]: 2014-01-26
              15:36:00-0500 14687 <br>
              [3772]: s1998 add_lockspace fail result -90 <br>
              Jan 26 15:36:00 office4a vdsm TaskManager.Task ERROR <br>
              Task=`8db9ff1a-2894-407a-915a-279f6a7eb205`::Unexpected <br>
              error#012Traceback <br>
              (most recent call last):#012&nbsp; File
              "/usr/share/vdsm/storage/task.py", <br>
              line <br>
              857, in _run#012 return fn(*args, **kargs)#012&nbsp; File <br>
              "/usr/share/vdsm/storage/task.py", line 318, in
run#012&nbsp;&nbsp;&nbsp;
              return <br>
              self.cmd(*self.argslist, **self.argsdict)#012 File <br>
              "/usr/share/vdsm/storage/sp.py", line 273, in startSpm#012
              <br>
              self.masterDomain.acquireHostId(self.id)#012&nbsp; File <br>
              "/usr/share/vdsm/storage/sd.py", line 458, in
              acquireHostId#012 <br>
              self._clusterLock.acquireHostId(hostId, async)#012&nbsp; File
<br>
              "/usr/share/vdsm/storage/clusterlock.py", line 189, in <br>
              acquireHostId#012&nbsp;&nbsp;&nbsp; raise
              se.AcquireHostIdFailure(self._sdUUID, <br>
              e)#012AcquireHostIdFailure: Cannot acquire host id: <br>
              ('0322a407-2b16-40dc-ac67-13d387c6eb4c',
              SanlockException(90, 'Sanlock <br>
              lockspace add failure', 'Message too long')) <br>
            </blockquote>
          </blockquote>
        </blockquote>
        <br>
        fede - thoughts on above? <br>
        (vojtech reported something similar, but it sorted out for him
        after some retries) <br>
        <br>
        <blockquote type="cite">
          <blockquote type="cite">
            <blockquote type="cite"> <br>
              Ted Miller <br>
              Elkhart, IN, USA <br>
              <br>
            </blockquote>
            <br>
            this is the new storage domain? what about the previous
            volume for the <br>
            first SD? <br>
          </blockquote>
          <br>
          The default/default data center/cluster had to be abandoned
          because of a <br>
          split-brain that could not be healed.&nbsp; Can't remove old
          storage from <br>
          database, can't get data center up due to corrupt storage,
          ends up a <br>
          circular argument. <br>
          <br>
          I started over with same hosts, totally new storage in new
          data center. <br>
          This mill/one data center/cluster was working fine with VM2
          storage, <br>
          then died. <br>
          <br>
          Ted Miller </blockquote>
      </blockquote>
      Looking at the problems with Sanlock, I read the page on it, then
      went poking around.<br>
      <br>
      I find the sanlock.log is full of:<br>
      <br>
      <font face="Courier New, Courier, monospace">2014-01-26
        17:26:30-0500 21316 [3772]: s3071 lockspace
0322a407-2b16-40dc-ac67-13d387c6eb4c:2:/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids:0<br>
        2014-01-26 17:26:30-0500 21316 [11733]: 0322a407 aio collect 0
        0x7f97380008c0:0x7f97380008d0:0x7f9738101000 result 0:0 match
        len 512<br>
        2014-01-26 17:26:30-0500 21316 [11733]: read_sectors
        delta_leader offset 512 rv -90
/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids<br>
        2014-01-26 17:26:31-0500 21317 [3772]: s3071 add_lockspace fail
        result -90<br>
      </font><br>
      Just looking at the log, it looks like sanlock is trying to
      "read_sectors delta_leader offset 512", probably in this file:
/rhev/data-center/mnt/glusterSD/10.41.65.2:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/ids<br>
      <br>
      Well, I can see why it fails -- the ids file is empty.<br>
      <font face="Courier New, Courier, monospace">[root@office4a ~]$ ls

/rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/

        -l<br>
        total 1029<br>
        -rw-rw---- 1 vdsm kvm&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0 Jan 22 00:44 ids<br>
        -rw-rw---- 1 vdsm kvm&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0 Jan 16 18:50 inbox<br>
        -rw-rw---- 1 vdsm kvm 2097152 Jan 21 18:20 leases<br>
        -rw-r--r-- 1 vdsm kvm&nbsp;&nbsp;&nbsp;&nbsp; 491 Jan 21 18:20
metadata<br>
        -rw-rw---- 1 vdsm kvm&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0 Jan 16 18:50 outbox</font><br>
      <br>
      I have no idea why it is empty, or what is supposed to be there,
      but anything trying to read with an offset of 512 is going to kick
      back an error on a 0 length file. :(<br>
      <br>
      I recall that on the previous ids file, on the domain that went up
      in smoke (improperly recovered split-brain), the file was about 1
      megabyte.<br>
      <br>
      The date on the 'ids' file is probably about when the Data Center
      went belly-up.<br>
      <br>
      I hope this helps someone figure out what I can do to get out of
      this endless circle of trying to start the Data Center.&nbsp; Maybe
      there is a way to force an initialization of the ids file, and
      then everyone could be happy?<br>
      <br>
      Ted Miller<br>
    </blockquote>
    Since this has turned into a quest that seems to be focused on
    sanlock, I have started a new thread under the subject: "sanlock
    can't read empty 'ids' file".<br>
    Ted Miller<br>
    <br>
  </body>
</html>

--------------080006060704060003070706--

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Users] Data Center stuck between "Non Responsive" and "Contending"