Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

24 Jan 2013

      This test harness setup here consists of two servers tied to NFS storage
via IB (NFS mounts are via IPoIB, NFS over RDMA is disabled) . All storage
domains are NFS. The issue does occur with both servers on when attempting
to bring them out of maintenance mode with the end result being
non-operational due to storage attach fail.

The current issue is now that with a working older commit the master
storage domain is "stuck" in state "locked" and I see the secondary issue
wherein VDSM cannot seem to find or contact the master storage domain even
though it is there. I am can mount the master storage domain manually and
and all content appears to be accounted for accordingly on either host.

Here is the current contents of the master storage domain metadata:
CLASS=Data
DESCRIPTION=orgrimmar
IOOPTIMEOUTSEC=1
LEASERETRIES=3
LEASETIMESEC=5
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=417
POOL_DESCRIPTION=Azeroth
POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active
POOL_SPM_ID=1
POOL_SPM_LVER=6
POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280
REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar
ROLE=Master
SDUUID=67534cca-1327-462a-b455-a04464084b31
TYPE=NFS
VERSION=3
_SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721

So now with the older working commit I now get this the
"StoragePoolMasterNotFound: Cannot find master domain" error (prior details
above when I worked backwards to that commit)

This is odd as the nodes can definitely reach the master storage domain:

showmount from one of the el6.3 nodes:
[root@kezan ~]# showmount -e 192.168.0.1
Export list for 192.168.0.1:
/ovirt/orgrimmar    192.168.0.0/16

mount/ls from one of the nodes:
[root@kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt
[root@kezan ~]# ls -al /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/
total 1100
drwxr-xr-x 2 vdsm kvm    4096 Jan 24 11:44 .
drwxr-xr-x 5 vdsm kvm    4096 Oct 19 16:16 ..
-rw-rw---- 1 vdsm kvm 1048576 Jan 19 22:09 ids
-rw-rw---- 1 vdsm kvm       0 Sep 25 00:46 inbox
-rw-rw---- 1 vdsm kvm 2097152 Jan 10 13:33 leases
-rw-r--r-- 1 vdsm kvm     903 Jan 10 13:39 metadata
-rw-rw---- 1 vdsm kvm       0 Sep 25 00:46 outbox

- DHC

On Thu, Jan 24, 2013 at 7:51 AM, ybronhei <ybronhei@redhat.com> wrote:
...
On 01/24/2013 12:44 AM, Dead Horse wrote:
...
I narrowed down on the commit where the originally reported issue crept
in:
commitfc3a44f71d2ef202cff18d72**03b9e4165b546621building and testing with
this commit or subsequent commits yields the original issue.
Interesting.. it might be related to this commit and we're trying to
reproduce it.
Did you try to remove that code and run again? does it work without the
additional of zombieReaper?
does the connectivity to the storage work well? when you run 'ls' on the
mounted folder you get see the files without a long delay ? it might
related to too long timeout when validating access to this mount..
we work on that.. any additional info can help
Thanks.
...
- DHC
On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
<deadhorseconsulting@gmail.com**>wrote:
Indeed reverting back to an older vdsm clears up the above issue. However
...
now I the issue is see is:
Thread-18::ERROR::2013-01-23
15:50:42,885::task::833::**TaskManager.Task::(_setError)
Task=`08709e68-bcbc-40d8-843a-**d69d4df40ac6`::Unexpected error
Traceback (most recent call last):
   File "/usr/share/vdsm/storage/task.**py", line 840, in _run
     return fn(*args, **kargs)
   File "/usr/share/vdsm/logUtils.py", line 42, in wrapper
     res = f(*args, **kwargs)
   File "/usr/share/vdsm/storage/hsm.**py", line 923, in
connectStoragePool
     masterVersion, options)
   File "/usr/share/vdsm/storage/hsm.**py", line 970, in
_connectStoragePool
     res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
   File "/usr/share/vdsm/storage/sp.**py", line 643, in connect
     self.__rebuild(msdUUID=**msdUUID, masterVersion=masterVersion)
   File "/usr/share/vdsm/storage/sp.**py", line 1167, in __rebuild
     self.masterDomain = self.getMasterDomain(msdUUID=**msdUUID,
masterVersion=masterVersion)
   File "/usr/share/vdsm/storage/sp.**py", line 1506, in getMasterDomain
     raise se.StoragePoolMasterNotFound(**self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain:
'spUUID=f90a0d1c-06ca-11e2-**a05b-00151712f280,
msdUUID=67534cca-1327-462a-**b455-a04464084b31'
Thread-18::DEBUG::2013-01-23
15:50:42,887::task::852::**TaskManager.Task::(_run)
Task=`08709e68-bcbc-40d8-843a-**d69d4df40ac6`::Task._run:
08709e68-bcbc-40d8-843a-**d69d4df40ac6
('f90a0d1c-06ca-11e2-a05b-**00151712f280', 2,
'f90a0d1c-06ca-11e2-a05b-**00151712f280',
'67534cca-1327-462a-b455-**a04464084b31', 433) {} failed - stopping task
This is with vdsm built from
commit25a2d8572ad32352227c98a8**6631300fbd6523c1
- DHC
On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse <
deadhorseconsulting@gmail.com> wrote:
VDSM was built from:
...
commit 166138e37e75767b32227746bb671b**1dab9cdd5e
Attached is the full vdsm log
I should also note that from engine perspective it sees the master
storage domain as locked and the others as unknown.
On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg <danken@redhat.com
...
wrote:
On Tue, Jan 22, 2013 at 04:02:24PM -0600, Dead Horse wrote:
...
...
Any ideas on this one? (from VDSM log):
Thread-25::DEBUG::2013-01-22
15:35:29,065::BindingXMLRPC::**914::vds::(wrapper) client
[3.57.111.30]::call
...
getCapabilities with () {}
Thread-25::ERROR::2013-01-22 15:35:29,113::netinfo::159::**
root::(speed)
cannot read ib0 speed
Traceback (most recent call last):
   File "/usr/lib64/python2.6/site-**packages/vdsm/netinfo.py", line
155,
in
...
speed
     s = int(file('/sys/class/net/%s/**speed' % dev).read())
IOError: [Errno 22] Invalid argument
Causes VDSM to fail to attach storage
I doubt that this is the cause of the failure, as vdsm has always
reported "0" for ib devices, and still is.
it happens only when you call to getCapabilities.. so it doesn't
related to the flow, and it can't effect the storage.
Dan: I guess this is not the issue but why is the IOError?
...
...
...
...
Does a former version works with your Engine?
Could you share more of your vdsm.log? I suppose the culprit lies in
one
one of the storage-related commands, not in statistics retrieval.
...
Engine side sees:
ERROR [org.ovirt.engine.core.bll.**storage.NFSStorageHelper]
(QuartzScheduler_Worker-96) [553ef26e] The connection with details
192.168.0.1:/ovirt/ds failed because of error code 100 and error
message
...
is: general exception
2013-01-22 15:35:30,160 INFO
[org.ovirt.engine.core.bll.**SetNonOperationalVdsCommand]
(QuartzScheduler_Worker-96) [1ab78378] Running command:
SetNonOperationalVdsCommand internal: true. Entities affected :  ID:
8970b3fe-1faf-11e2-bc1f-**00151712f280 Type: VDS
2013-01-22 15:35:30,200 INFO
[org.ovirt.engine.core.**vdsbroker.**SetVdsStatusVDSCommand]
(QuartzScheduler_Worker-96) [1ab78378] START,
SetVdsStatusVDSCommand(**HostName = kezan, HostId =
8970b3fe-1faf-11e2-bc1f-**00151712f280, status=NonOperational,
nonOperationalReason=STORAGE_**DOMAIN_UNREACHABLE), log id: 4af5c4cd
2013-01-22 15:35:30,211 INFO
[org.ovirt.engine.core.**vdsbroker.**SetVdsStatusVDSCommand]
(QuartzScheduler_Worker-96) [1ab78378] FINISH, SetVdsStatusVDSCommand,
log
...
id: 4af5c4cd
2013-01-22 15:35:30,242 ERROR
[org.ovirt.engine.core.dal.**dbbroker.auditloghandling.**
AuditLogDirector]
(QuartzScheduler_Worker-96) [1ab78378] Try to add duplicate audit log
values with the same name. Type: VDS_SET_NONOPERATIONAL_DOMAIN. Value:
storagepoolname
Engine = latest master
VDSM = latest master
Since "latest master" is an unstable reference by definition, I'm sure
that History would thank you if you post the exact version (git hash?)
of the code.
node = el6
...
______________________________**_________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/**mailman/listinfo/users<http://lists.ovirt.org/mailman/listinfo/users>
--
Yaniv Bronhaim.
RedHat, Israel
09-7692289
054-7744187