<div dir="ltr"><div><div><div><div>This test harness setup here consists of two servers tied to NFS storage via IB (NFS mounts are via IPoIB, NFS over RDMA is disabled) . All storage domains are NFS. The issue does occur with both servers on when attempting to bring them out of maintenance mode with the end result being non-operational due to storage attach fail.<br>
<br></div>The current issue is now that with a working older commit the master storage domain is "stuck" in state "locked" and I see the secondary issue wherein VDSM cannot seem to find or contact the master storage domain even though it is there. I am can mount the master storage domain manually and and all content appears to be accounted for accordingly on either host. <br>
<br></div>Here is the current contents of the master storage domain metadata:<br>CLASS=Data<br>DESCRIPTION=orgrimmar<br>IOOPTIMEOUTSEC=1<br>LEASERETRIES=3<br>LEASETIMESEC=5<br>LOCKPOLICY=<br>LOCKRENEWALINTERVALSEC=5<br>MASTER_VERSION=417<br>
POOL_DESCRIPTION=Azeroth<br>POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active<br>
POOL_SPM_ID=1<br>POOL_SPM_LVER=6<br>POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280<br>REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar<br>ROLE=Master<br>SDUUID=67534cca-1327-462a-b455-a04464084b31<br>TYPE=NFS<br>VERSION=3<br>_SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721<br>
<br>So now with the older working commit I now get this the
"StoragePoolMasterNotFound: Cannot find master domain" error (prior details
above when I worked backwards to that commit)<br><br>This is odd as the nodes can definitely reach the master storage domain:<br><br></div>showmount from one of the el6.3 nodes:<br>[root@kezan ~]# showmount -e 192.168.0.1<br>
Export list for <a href="http://192.168.0.1">192.168.0.1</a>:<br>/ovirt/orgrimmar <a href="http://192.168.0.0/16">192.168.0.0/16</a><br><br></div>mount/ls from one of the nodes:<br><div><div>[root@kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt<br>
[root@kezan ~]# ls -al /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/<br>total 1100<br>drwxr-xr-x 2 vdsm kvm 4096 Jan 24 11:44 .<br>drwxr-xr-x 5 vdsm kvm 4096 Oct 19 16:16 ..<br>-rw-rw---- 1 vdsm kvm 1048576 Jan 19 22:09 ids<br>
-rw-rw---- 1 vdsm kvm 0 Sep 25 00:46 inbox<br>-rw-rw---- 1 vdsm kvm 2097152 Jan 10 13:33 leases<br>-rw-r--r-- 1 vdsm kvm 903 Jan 10 13:39 metadata<br>-rw-rw---- 1 vdsm kvm 0 Sep 25 00:46 outbox<br><br><br>
</div><div>- DHC<br></div><div><br></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 24, 2013 at 7:51 AM, ybronhei <span dir="ltr"><<a href="mailto:ybronhei@redhat.com" target="_blank">ybronhei@redhat.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 01/24/2013 12:44 AM, Dead Horse wrote:<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">
I narrowed down on the commit where the originally reported issue crept in:<br></div>
commitfc3a44f71d2ef202cff18d72<u></u>03b9e4165b546621building and testing with<div class="im"><br>
this commit or subsequent commits yields the original issue.<br>
</div></blockquote>
Interesting.. it might be related to this commit and we're trying to reproduce it.<br>
<br>
Did you try to remove that code and run again? does it work without the additional of zombieReaper?<br>
does the connectivity to the storage work well? when you run 'ls' on the mounted folder you get see the files without a long delay ? it might related to too long timeout when validating access to this mount..<br>
we work on that.. any additional info can help<br>
<br>
Thanks.<div><div class="h5"><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
- DHC<br>
<br>
<br>
On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse<br>
<<a href="mailto:deadhorseconsulting@gmail.com" target="_blank">deadhorseconsulting@gmail.com</a><u></u>>wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Indeed reverting back to an older vdsm clears up the above issue. However<br>
now I the issue is see is:<br>
Thread-18::ERROR::2013-01-23<br>
15:50:42,885::task::833::<u></u>TaskManager.Task::(_setError)<br>
Task=`08709e68-bcbc-40d8-843a-<u></u>d69d4df40ac6`::Unexpected error<br>
<br>
Traceback (most recent call last):<br>
File "/usr/share/vdsm/storage/task.<u></u>py", line 840, in _run<br>
return fn(*args, **kargs)<br>
File "/usr/share/vdsm/logUtils.py", line 42, in wrapper<br>
res = f(*args, **kwargs)<br>
File "/usr/share/vdsm/storage/hsm.<u></u>py", line 923, in connectStoragePool<br>
masterVersion, options)<br>
File "/usr/share/vdsm/storage/hsm.<u></u>py", line 970, in _connectStoragePool<br>
res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)<br>
File "/usr/share/vdsm/storage/sp.<u></u>py", line 643, in connect<br>
self.__rebuild(msdUUID=<u></u>msdUUID, masterVersion=masterVersion)<br>
File "/usr/share/vdsm/storage/sp.<u></u>py", line 1167, in __rebuild<br>
self.masterDomain = self.getMasterDomain(msdUUID=<u></u>msdUUID,<br>
masterVersion=masterVersion)<br>
File "/usr/share/vdsm/storage/sp.<u></u>py", line 1506, in getMasterDomain<br>
raise se.StoragePoolMasterNotFound(<u></u>self.spUUID, msdUUID)<br>
StoragePoolMasterNotFound: Cannot find master domain:<br>
'spUUID=f90a0d1c-06ca-11e2-<u></u>a05b-00151712f280,<br>
msdUUID=67534cca-1327-462a-<u></u>b455-a04464084b31'<br>
Thread-18::DEBUG::2013-01-23<br>
15:50:42,887::task::852::<u></u>TaskManager.Task::(_run)<br>
Task=`08709e68-bcbc-40d8-843a-<u></u>d69d4df40ac6`::Task._run:<br>
08709e68-bcbc-40d8-843a-<u></u>d69d4df40ac6<br>
('f90a0d1c-06ca-11e2-a05b-<u></u>00151712f280', 2,<br>
'f90a0d1c-06ca-11e2-a05b-<u></u>00151712f280',<br>
'67534cca-1327-462a-b455-<u></u>a04464084b31', 433) {} failed - stopping task<br>
<br>
This is with vdsm built from<br>
commit25a2d8572ad32352227c98a8<u></u>6631300fbd6523c1<br>
- DHC<br>
<br>
<br>
On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse <<br>
<a href="mailto:deadhorseconsulting@gmail.com" target="_blank">deadhorseconsulting@gmail.com</a>> wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
VDSM was built from:<br>
commit 166138e37e75767b32227746bb671b<u></u>1dab9cdd5e<br>
<br>
Attached is the full vdsm log<br>
<br>
I should also note that from engine perspective it sees the master<br>
storage domain as locked and the others as unknown.<br>
<br>
<br>
On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg <<a href="mailto:danken@redhat.com" target="_blank">danken@redhat.com</a>>wrote:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On Tue, Jan 22, 2013 at 04:02:24PM -0600, Dead Horse wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Any ideas on this one? (from VDSM log):<br>
Thread-25::DEBUG::2013-01-22<br>
15:35:29,065::BindingXMLRPC::<u></u>914::vds::(wrapper) client<br>
</blockquote>
[3.57.111.30]::call<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
getCapabilities with () {}<br>
Thread-25::ERROR::2013-01-22 15:35:29,113::netinfo::159::<u></u>root::(speed)<br>
cannot read ib0 speed<br>
Traceback (most recent call last):<br>
File "/usr/lib64/python2.6/site-<u></u>packages/vdsm/netinfo.py", line 155,<br>
</blockquote>
in<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
speed<br>
s = int(file('/sys/class/net/%s/<u></u>speed' % dev).read())<br>
IOError: [Errno 22] Invalid argument<br>
<br>
Causes VDSM to fail to attach storage<br>
</blockquote>
<br>
I doubt that this is the cause of the failure, as vdsm has always<br>
reported "0" for ib devices, and still is.<br>
</blockquote></blockquote></blockquote></blockquote></div></div>
it happens only when you call to getCapabilities.. so it doesn't related to the flow, and it can't effect the storage.<br>
Dan: I guess this is not the issue but why is the IOError?<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Does a former version works with your Engine?<br>
Could you share more of your vdsm.log? I suppose the culprit lies in one<br>
one of the storage-related commands, not in statistics retrieval.<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Engine side sees:<br>
ERROR [org.ovirt.engine.core.bll.<u></u>storage.NFSStorageHelper]<br>
(QuartzScheduler_Worker-96) [553ef26e] The connection with details<br>
192.168.0.1:/ovirt/ds failed because of error code 100 and error<br>
</blockquote>
message<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
is: general exception<br>
2013-01-22 15:35:30,160 INFO<br>
[org.ovirt.engine.core.bll.<u></u>SetNonOperationalVdsCommand]<br>
(QuartzScheduler_Worker-96) [1ab78378] Running command:<br>
SetNonOperationalVdsCommand internal: true. Entities affected : ID:<br>
8970b3fe-1faf-11e2-bc1f-<u></u>00151712f280 Type: VDS<br>
2013-01-22 15:35:30,200 INFO<br>
[org.ovirt.engine.core.<u></u>vdsbroker.<u></u>SetVdsStatusVDSCommand]<br>
(QuartzScheduler_Worker-96) [1ab78378] START,<br>
SetVdsStatusVDSCommand(<u></u>HostName = kezan, HostId =<br>
8970b3fe-1faf-11e2-bc1f-<u></u>00151712f280, status=NonOperational,<br>
nonOperationalReason=STORAGE_<u></u>DOMAIN_UNREACHABLE), log id: 4af5c4cd<br>
2013-01-22 15:35:30,211 INFO<br>
[org.ovirt.engine.core.<u></u>vdsbroker.<u></u>SetVdsStatusVDSCommand]<br>
(QuartzScheduler_Worker-96) [1ab78378] FINISH, SetVdsStatusVDSCommand,<br>
</blockquote>
log<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
id: 4af5c4cd<br>
2013-01-22 15:35:30,242 ERROR<br>
[org.ovirt.engine.core.dal.<u></u>dbbroker.auditloghandling.<u></u>AuditLogDirector]<br>
(QuartzScheduler_Worker-96) [1ab78378] Try to add duplicate audit log<br>
values with the same name. Type: VDS_SET_NONOPERATIONAL_DOMAIN. Value:<br>
storagepoolname<br>
<br>
Engine = latest master<br>
VDSM = latest master<br>
</blockquote>
<br>
Since "latest master" is an unstable reference by definition, I'm sure<br>
that History would thank you if you post the exact version (git hash?)<br>
of the code.<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
node = el6<br>
</blockquote>
<br>
<br>
</blockquote>
<br>
</blockquote>
<br>
</blockquote>
<br>
<br>
<br></div></div>
______________________________<u></u>_________________<br>
Users mailing list<br>
<a href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br>
<a href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/<u></u>mailman/listinfo/users</a><br>
<br><span class="HOEnZb"><font color="#888888">
</font></span></blockquote><span class="HOEnZb"><font color="#888888">
<br>
<br>
-- <br>
Yaniv Bronhaim.<br>
RedHat, Israel<br>
09-7692289<br>
054-7744187<br>
</font></span></blockquote></div><br></div>