<div dir="ltr"><div><div><div><div>This test harness setup here consists of two servers tied to NFS storage via IB (NFS mounts are via IPoIB, NFS over RDMA is disabled) . All storage domains are NFS. The issue does occur with both servers on when attempting to bring them out of maintenance mode with the end result being non-operational due to storage attach fail.<br>

<br></div>The current issue is now that with a working older commit the master storage domain is &quot;stuck&quot; in state &quot;locked&quot; and I see the secondary issue wherein VDSM cannot seem to find or contact the master storage domain even though it is there. I am can mount the master storage domain manually and and all content appears to be accounted for accordingly on either host. <br>

<br></div>Here is the current contents of the master storage domain metadata:<br>CLASS=Data<br>DESCRIPTION=orgrimmar<br>IOOPTIMEOUTSEC=1<br>LEASERETRIES=3<br>LEASETIMESEC=5<br>LOCKPOLICY=<br>LOCKRENEWALINTERVALSEC=5<br>MASTER_VERSION=417<br>

POOL_DESCRIPTION=Azeroth<br>POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active<br>

POOL_SPM_ID=1<br>POOL_SPM_LVER=6<br>POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280<br>REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar<br>ROLE=Master<br>SDUUID=67534cca-1327-462a-b455-a04464084b31<br>TYPE=NFS<br>VERSION=3<br>_SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721<br>

<br>So now with the older working commit I now get this the 

&quot;StoragePoolMasterNotFound: Cannot find master domain&quot; error (prior details 

above when I worked backwards to that commit)<br><br>This is odd as the nodes can definitely reach the master storage domain:<br><br></div>showmount from one of the el6.3 nodes:<br>[root@kezan ~]# showmount -e 192.168.0.1<br>

Export list for <a href="http://192.168.0.1">192.168.0.1</a>:<br>/ovirt/orgrimmar    <a href="http://192.168.0.0/16">192.168.0.0/16</a><br><br></div>mount/ls from one of the nodes:<br><div><div>[root@kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt<br>

[root@kezan ~]# ls -al /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/<br>total 1100<br>drwxr-xr-x 2 vdsm kvm    4096 Jan 24 11:44 .<br>drwxr-xr-x 5 vdsm kvm    4096 Oct 19 16:16 ..<br>-rw-rw---- 1 vdsm kvm 1048576 Jan 19 22:09 ids<br>

-rw-rw---- 1 vdsm kvm       0 Sep 25 00:46 inbox<br>-rw-rw---- 1 vdsm kvm 2097152 Jan 10 13:33 leases<br>-rw-r--r-- 1 vdsm kvm     903 Jan 10 13:39 metadata<br>-rw-rw---- 1 vdsm kvm       0 Sep 25 00:46 outbox<br><br><br>

</div><div>- DHC<br></div><div><br></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jan 24, 2013 at 7:51 AM, ybronhei <span dir="ltr">&lt;<a href="mailto:ybronhei@redhat.com" target="_blank">ybronhei@redhat.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 01/24/2013 12:44 AM, Dead Horse wrote:<br>

</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

I narrowed down on the commit where the originally reported issue crept in:<br></div>

commitfc3a44f71d2ef202cff18d72<u></u>03b9e4165b546621building and testing with<div class="im"><br>

this commit or subsequent commits yields the original issue.<br>

</div></blockquote>

Interesting.. it might be related to this commit and we&#39;re trying to reproduce it.<br>

<br>

Did you try to remove that code and run again? does it work without the additional of zombieReaper?<br>

does the connectivity to the storage work well? when you run &#39;ls&#39; on the mounted folder you get see the files without a long delay ? it might related to too long timeout when validating access to this mount..<br>


we work on that.. any additional info can help<br>

<br>

Thanks.<div><div class="h5"><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

- DHC<br>

<br>

<br>

On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse<br>

&lt;<a href="mailto:deadhorseconsulting@gmail.com" target="_blank">deadhorseconsulting@gmail.com</a><u></u>&gt;wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Indeed reverting back to an older vdsm clears up the above issue. However<br>

now I the issue is see is:<br>

Thread-18::ERROR::2013-01-23<br>

15:50:42,885::task::833::<u></u>TaskManager.Task::(_setError)<br>

Task=`08709e68-bcbc-40d8-843a-<u></u>d69d4df40ac6`::Unexpected error<br>

<br>

Traceback (most recent call last):<br>

   File &quot;/usr/share/vdsm/storage/task.<u></u>py&quot;, line 840, in _run<br>

     return fn(*args, **kargs)<br>

   File &quot;/usr/share/vdsm/logUtils.py&quot;, line 42, in wrapper<br>

     res = f(*args, **kwargs)<br>

   File &quot;/usr/share/vdsm/storage/hsm.<u></u>py&quot;, line 923, in connectStoragePool<br>

     masterVersion, options)<br>

   File &quot;/usr/share/vdsm/storage/hsm.<u></u>py&quot;, line 970, in _connectStoragePool<br>

     res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)<br>

   File &quot;/usr/share/vdsm/storage/sp.<u></u>py&quot;, line 643, in connect<br>

     self.__rebuild(msdUUID=<u></u>msdUUID, masterVersion=masterVersion)<br>

   File &quot;/usr/share/vdsm/storage/sp.<u></u>py&quot;, line 1167, in __rebuild<br>

     self.masterDomain = self.getMasterDomain(msdUUID=<u></u>msdUUID,<br>

masterVersion=masterVersion)<br>

   File &quot;/usr/share/vdsm/storage/sp.<u></u>py&quot;, line 1506, in getMasterDomain<br>

     raise se.StoragePoolMasterNotFound(<u></u>self.spUUID, msdUUID)<br>

StoragePoolMasterNotFound: Cannot find master domain:<br>

&#39;spUUID=f90a0d1c-06ca-11e2-<u></u>a05b-00151712f280,<br>

msdUUID=67534cca-1327-462a-<u></u>b455-a04464084b31&#39;<br>

Thread-18::DEBUG::2013-01-23<br>

15:50:42,887::task::852::<u></u>TaskManager.Task::(_run)<br>

Task=`08709e68-bcbc-40d8-843a-<u></u>d69d4df40ac6`::Task._run:<br>

08709e68-bcbc-40d8-843a-<u></u>d69d4df40ac6<br>

(&#39;f90a0d1c-06ca-11e2-a05b-<u></u>00151712f280&#39;, 2,<br>

&#39;f90a0d1c-06ca-11e2-a05b-<u></u>00151712f280&#39;,<br>

&#39;67534cca-1327-462a-b455-<u></u>a04464084b31&#39;, 433) {} failed - stopping task<br>

<br>

This is with vdsm built from<br>

commit25a2d8572ad32352227c98a8<u></u>6631300fbd6523c1<br>

- DHC<br>

<br>

<br>

On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse &lt;<br>

<a href="mailto:deadhorseconsulting@gmail.com" target="_blank">deadhorseconsulting@gmail.com</a>&gt; wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

VDSM was built from:<br>

commit 166138e37e75767b32227746bb671b<u></u>1dab9cdd5e<br>

<br>

Attached is the full vdsm log<br>

<br>

I should also note that from engine perspective it sees the master<br>

storage domain as locked and the others as unknown.<br>

<br>

<br>

On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg &lt;<a href="mailto:danken@redhat.com" target="_blank">danken@redhat.com</a>&gt;wrote:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On Tue, Jan 22, 2013 at 04:02:24PM -0600, Dead Horse wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Any ideas on this one? (from VDSM log):<br>

Thread-25::DEBUG::2013-01-22<br>

15:35:29,065::BindingXMLRPC::<u></u>914::vds::(wrapper) client<br>

</blockquote>

[3.57.111.30]::call<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

getCapabilities with () {}<br>

Thread-25::ERROR::2013-01-22 15:35:29,113::netinfo::159::<u></u>root::(speed)<br>

cannot read ib0 speed<br>

Traceback (most recent call last):<br>

   File &quot;/usr/lib64/python2.6/site-<u></u>packages/vdsm/netinfo.py&quot;, line 155,<br>

</blockquote>

in<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

speed<br>

     s = int(file(&#39;/sys/class/net/%s/<u></u>speed&#39; % dev).read())<br>

IOError: [Errno 22] Invalid argument<br>

<br>

Causes VDSM to fail to attach storage<br>

</blockquote>

<br>

I doubt that this is the cause of the failure, as vdsm has always<br>

reported &quot;0&quot; for ib devices, and still is.<br>

</blockquote></blockquote></blockquote></blockquote></div></div>

it happens only when you call to getCapabilities.. so it doesn&#39;t related to the flow, and it can&#39;t effect the storage.<br>

Dan: I guess this is not the issue but why is the IOError?<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Does a former version works with your Engine?<br>

Could you share more of your vdsm.log? I suppose the culprit lies in one<br>

one of the storage-related commands, not in statistics retrieval.<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Engine side sees:<br>

ERROR [org.ovirt.engine.core.bll.<u></u>storage.NFSStorageHelper]<br>

(QuartzScheduler_Worker-96) [553ef26e] The connection with details<br>

192.168.0.1:/ovirt/ds failed because of error code 100 and error<br>

</blockquote>

message<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

is: general exception<br>

2013-01-22 15:35:30,160 INFO<br>

[org.ovirt.engine.core.bll.<u></u>SetNonOperationalVdsCommand]<br>

(QuartzScheduler_Worker-96) [1ab78378] Running command:<br>

SetNonOperationalVdsCommand internal: true. Entities affected :  ID:<br>

8970b3fe-1faf-11e2-bc1f-<u></u>00151712f280 Type: VDS<br>

2013-01-22 15:35:30,200 INFO<br>

[org.ovirt.engine.core.<u></u>vdsbroker.<u></u>SetVdsStatusVDSCommand]<br>

(QuartzScheduler_Worker-96) [1ab78378] START,<br>

SetVdsStatusVDSCommand(<u></u>HostName = kezan, HostId =<br>

8970b3fe-1faf-11e2-bc1f-<u></u>00151712f280, status=NonOperational,<br>

nonOperationalReason=STORAGE_<u></u>DOMAIN_UNREACHABLE), log id: 4af5c4cd<br>

2013-01-22 15:35:30,211 INFO<br>

[org.ovirt.engine.core.<u></u>vdsbroker.<u></u>SetVdsStatusVDSCommand]<br>

(QuartzScheduler_Worker-96) [1ab78378] FINISH, SetVdsStatusVDSCommand,<br>

</blockquote>

log<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

id: 4af5c4cd<br>

2013-01-22 15:35:30,242 ERROR<br>

[org.ovirt.engine.core.dal.<u></u>dbbroker.auditloghandling.<u></u>AuditLogDirector]<br>

(QuartzScheduler_Worker-96) [1ab78378] Try to add duplicate audit log<br>

values with the same name. Type: VDS_SET_NONOPERATIONAL_DOMAIN. Value:<br>

storagepoolname<br>

<br>

Engine = latest master<br>

VDSM = latest master<br>

</blockquote>

<br>

Since &quot;latest master&quot; is an unstable reference by definition, I&#39;m sure<br>

that History would thank you if you post the exact version (git hash?)<br>

of the code.<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

node = el6<br>

</blockquote>

<br>

<br>

</blockquote>

<br>

</blockquote>

<br>

</blockquote>

<br>

<br>

<br></div></div>

______________________________<u></u>_________________<br>

Users mailing list<br>

<a href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br>

<a href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/<u></u>mailman/listinfo/users</a><br>

<br><span class="HOEnZb"><font color="#888888">

</font></span></blockquote><span class="HOEnZb"><font color="#888888">

<br>

<br>

-- <br>

Yaniv Bronhaim.<br>

RedHat, Israel<br>

09-7692289<br>

054-7744187<br>

</font></span></blockquote></div><br></div>