[Users] latest vdsm cannot read ib device speeds causing storage attach fail

Wed Jan 23 22:44:29 UTC 2013

I narrowed down on the commit where the originally reported issue crept in:
commitfc3a44f71d2ef202cff18d7203b9e4165b546621building and testing with
this commit or subsequent commits yields the original issue.

- DHC

On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
<deadhorseconsulting at gmail.com>wrote:

> Indeed reverting back to an older vdsm clears up the above issue. However
> now I the issue is see is:
> Thread-18::ERROR::2013-01-23
> 15:50:42,885::task::833::TaskManager.Task::(_setError)
> Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected error
>
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 840, in _run
>     return fn(*args, **kargs)
>   File "/usr/share/vdsm/logUtils.py", line 42, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/storage/hsm.py", line 923, in connectStoragePool
>     masterVersion, options)
>   File "/usr/share/vdsm/storage/hsm.py", line 970, in _connectStoragePool
>     res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
>   File "/usr/share/vdsm/storage/sp.py", line 643, in connect
>     self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
>   File "/usr/share/vdsm/storage/sp.py", line 1167, in __rebuild
>     self.masterDomain = self.getMasterDomain(msdUUID=msdUUID,
> masterVersion=masterVersion)
>   File "/usr/share/vdsm/storage/sp.py", line 1506, in getMasterDomain
>     raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
> StoragePoolMasterNotFound: Cannot find master domain:
> 'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,
> msdUUID=67534cca-1327-462a-b455-a04464084b31'
> Thread-18::DEBUG::2013-01-23
> 15:50:42,887::task::852::TaskManager.Task::(_run)
> Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:
> 08709e68-bcbc-40d8-843a-d69d4df40ac6
> ('f90a0d1c-06ca-11e2-a05b-00151712f280', 2,
> 'f90a0d1c-06ca-11e2-a05b-00151712f280',
> '67534cca-1327-462a-b455-a04464084b31', 433) {} failed - stopping task
>
> This is with vdsm built from
> commit25a2d8572ad32352227c98a86631300fbd6523c1
> - DHC
>
>
> On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse <
> deadhorseconsulting at gmail.com> wrote:
>
>> VDSM was built from:
>> commit 166138e37e75767b32227746bb671b1dab9cdd5e
>>
>> Attached is the full vdsm log
>>
>> I should also note that from engine perspective it sees the master
>> storage domain as locked and the others as unknown.
>>
>>
>> On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg <danken at redhat.com>wrote:
>>
>>> On Tue, Jan 22, 2013 at 04:02:24PM -0600, Dead Horse wrote:
>>> > Any ideas on this one? (from VDSM log):
>>> > Thread-25::DEBUG::2013-01-22
>>> > 15:35:29,065::BindingXMLRPC::914::vds::(wrapper) client
>>> [3.57.111.30]::call
>>> > getCapabilities with () {}
>>> > Thread-25::ERROR::2013-01-22 15:35:29,113::netinfo::159::root::(speed)
>>> > cannot read ib0 speed
>>> > Traceback (most recent call last):
>>> >   File "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py", line 155,
>>> in
>>> > speed
>>> >     s = int(file('/sys/class/net/%s/speed' % dev).read())
>>> > IOError: [Errno 22] Invalid argument
>>> >
>>> > Causes VDSM to fail to attach storage
>>>
>>> I doubt that this is the cause of the failure, as vdsm has always
>>> reported "0" for ib devices, and still is.
>>>
>>> Does a former version works with your Engine?
>>> Could you share more of your vdsm.log? I suppose the culprit lies in one
>>> one of the storage-related commands, not in statistics retrieval.
>>>
>>> >
>>> > Engine side sees:
>>> > ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
>>> > (QuartzScheduler_Worker-96) [553ef26e] The connection with details
>>> > 192.168.0.1:/ovirt/ds failed because of error code 100 and error
>>> message
>>> > is: general exception
>>> > 2013-01-22 15:35:30,160 INFO
>>> > [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
>>> > (QuartzScheduler_Worker-96) [1ab78378] Running command:
>>> > SetNonOperationalVdsCommand internal: true. Entities affected :  ID:
>>> > 8970b3fe-1faf-11e2-bc1f-00151712f280 Type: VDS
>>> > 2013-01-22 15:35:30,200 INFO
>>> > [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>>> > (QuartzScheduler_Worker-96) [1ab78378] START,
>>> > SetVdsStatusVDSCommand(HostName = kezan, HostId =
>>> > 8970b3fe-1faf-11e2-bc1f-00151712f280, status=NonOperational,
>>> > nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 4af5c4cd
>>> > 2013-01-22 15:35:30,211 INFO
>>> > [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>>> > (QuartzScheduler_Worker-96) [1ab78378] FINISH, SetVdsStatusVDSCommand,
>>> log
>>> > id: 4af5c4cd
>>> > 2013-01-22 15:35:30,242 ERROR
>>> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> > (QuartzScheduler_Worker-96) [1ab78378] Try to add duplicate audit log
>>> > values with the same name. Type: VDS_SET_NONOPERATIONAL_DOMAIN. Value:
>>> > storagepoolname
>>> >
>>> > Engine = latest master
>>> > VDSM = latest master
>>>
>>> Since "latest master" is an unstable reference by definition, I'm sure
>>> that History would thank you if you post the exact version (git hash?)
>>> of the code.
>>>
>>> > node = el6
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130123/3cd135f6/attachment-0001.html>