<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 23, 2018 at 6:47 PM, Sergey Kulikov <span dir="ltr">&lt;<a href="mailto:serg_k@msm.ru" target="_blank">serg_k@msm.ru</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

Or maybe somebody can point me to the right place for submitting this?<br>

Thanks. :)<br>

<br></blockquote><div>CentOS have a bugtracker[1], but I think it&#39;s worthwhile understanding if it is reproducible with other OS. Fedora, for example.</div><div>Y.</div><div><br></div><div>[1] <a href="https://bugs.centos.org/main_page.php">https://bugs.centos.org/main_page.php</a> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

---<br>

<br>

<br>

<br>

 Monday, January 22, 2018, 14:10:53:<br>

<div class="gmail-HOEnZb"><div class="gmail-h5"><br>

&gt; This is test environment, running Centos 7.4, oVirt 4.2.0, kernel 3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)<br>

&gt;<br>

&gt;<br>

&gt; 1. Can&#39;t force NFS to 4.0.<br>

&gt; Some time ago, I&#39;ve set my NFS version for all storage domains to V4, because there was a bug with Netapp data ontap 8.x<br>

&gt; and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,<br>

&gt; so there were no problems related to NFS, after some time Centos 7.4 was released, and I&#39;ve noticed that mount points started to hang again,<br>

&gt; NFS was mounted with vers=4.1, and it&#39;s not possible to change to 4.0, both options &quot;V4&quot; and &quot;V4.1&quot; mounts as 4.1. Looks like V4 option is<br>

&gt; system default version for 4.X, and as I know it was changed in Centos 7.4 from 4.0 to 4.1, maybe 4.0 option should be added<br>

&gt; to force 4.0 version? because adding vers=/nfsvers= in &quot;Additional mount options&quot; is denied by ovirt.<br>

&gt; I know, I can turn it off on netapp side, but there may be situations where storage is out of control. And 4.0 version can&#39;t be<br>

&gt; set on ovirt side.<br>

&gt;<br>

&gt; 2. This bug isn&#39;t directly related to ovirt, but affects it.<br>

&gt; Don&#39;t really shure that this is right place to report.<br>

&gt; As I&#39;ve said before there were bug with NFS 4.1, Netapp data ontap 8 and RHEL 7.x, but it was fixed in otap 9.x,<br>

&gt; Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D<br>

&gt; After updating to centos 7.4 nfs domains in ovirt started to hang\lock again, This happens randomly, on random hosts, after few<br>

&gt; days of uptime, entire datacenter goes offline, hosts down, storage domains down, some vms in UP and some in unknown state, but<br>

&gt; actually VMs are working, HostedEngine also working, but I can&#39;t control the environment.<br>

&gt; There are many hanging ioprocess(&gt;1300) and vdsm processes(&gt;1300) on some hosts, also there are some dd commands, that are checking<br>

&gt; storage hanging:<br>

&gt;         ├─vdsmd─┬─2*[dd]<br>

&gt;         │       ├─1304*[ioprocess───{<wbr>ioprocess}]<br>

&gt;         │       ├─12*[ioprocess───4*[{<wbr>ioprocess}]]<br>

&gt;         │       └─1365*[{vdsmd}]<br>

&gt; vdsm     19470  0.0  0.0   4360   348 ?        D&lt;   Jan21   0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.<wbr>xx.xx.xx:_test__nfs__sas_iso/<wbr>6cd147b4-8039-4f8a-8aa7-<wbr>5fd444454d81/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct<br>

&gt; vdsm     40707  0.0  0.0   4360   348 ?        D&lt;   00:44   0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.<wbr>xx.xx.xx:_test__nfs__sas_<wbr>export/58d9e2c2-8fef-4abc-<wbr>be13-a273d6af320f/dom_md/<wbr>metadata of=/dev/null bs=4096 count=1 iflag=direct<br>

&gt;<br>

&gt; vdsm is hanging at 100% cpu load<br>

&gt; If I&#39;ll try to ls this files ls will hang.<br>

&gt;<br>

&gt; I&#39;ve made some dump of traffic, so looks like problem with STATID, I&#39;ve found 2 issues on RedHat web site, but they aren&#39;t<br>

&gt; publically available, so i can&#39;t read the solution:<br>

&gt; <a href="https://access.redhat.com/solutions/3214331" rel="noreferrer" target="_blank">https://access.redhat.com/<wbr>solutions/3214331</a>   (in my case I have STATEID test)<br>

&gt; <a href="https://access.redhat.com/solutions/3164451" rel="noreferrer" target="_blank">https://access.redhat.com/<wbr>solutions/3164451</a>   (in my case there is no manager thread)<br>

&gt; But it looks&#39; that I&#39;ve another issue with stateid,<br>

&gt; According to dumps my hosts are sending: TEST_STATEID<br>

&gt; netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)<br>

&gt; After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH, OPEN, ACCESS, GETATTR<br>

&gt; Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205<br>

&gt; Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096<br>

&gt; Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID<br>

&gt;<br>

&gt;<br>

&gt; Entire conversaion looks like:<br>

&gt; No.     Time           Source             Destination       Protocol  Length Info<br>

&gt;       1 0.000000       10._host_          10._netapp_        NFS      238    V4 Call (Reply In 2) TEST_STATEID<br>

&gt;       2 0.000251       10._netapp_        10._host_          NFS      170    V4 Reply (Call In 1) TEST_STATEID (here is Status: NFS4ERR_BAD_STATEID (10025))<br>

&gt;       3 0.000352       10._host_          10._netapp_        NFS      338    V4 Call (Reply In 4) OPEN DH: 0xa2c3ad28/<br>

&gt;       4 0.000857       10._netapp_        10._host_          NFS      394    V4 Reply (Call In 3) OPEN StateID: 0xa205<br>

&gt;       5 0.000934       10._host_          10._netapp_        NFS      302    V4 Call (Reply In 8) READ StateID: 0xca5f Offset: 0 Len: 4096<br>

&gt;       6 0.000964       10._host_          10._netapp_        NFS      302    V4 Call (Reply In 9) READ StateID: 0xca5f Offset: 0 Len: 4096<br>

&gt;       7 0.001133       10._netapp_        10._host_          TCP      70     2049 → 683 [ACK] Seq=425 Ack=901 Win=10240 Len=0 TSval=225608100 TSecr=302215289<br>

&gt;       8 0.001258       10._netapp_        10._host_          NFS      170    V4 Reply (Call In 5) READ Status: NFS4ERR_BAD_STATEID<br>

&gt;       9 0.001320       10._netapp_        10._host_          NFS      170    V4 Reply (Call In 6) READ Status: NFS4ERR_BAD_STATEID<br>

&gt;<br>

&gt; Sometimes clearing locks on netapp(vserver locks break) and killing dd\ioprocess will help for a while.<br>

&gt; Right now I&#39;ve my test setup in this state, looks like lock problem is always with metadata\disk check, but not domain itself,<br>

&gt; I can read and write other files in this mountpoint from the same host.<br>

&gt;<br>

&gt; Hosts have 3.10.0-693.11.6.el7.x86_64 kernel, ovirt 4.2.0<br>

&gt; can&#39;t find out If it&#39;s Netapp or Centos bug.<br>

&gt; If somebody wants to look closer on dumps, I can mail them directly.<br>

<br>

</div></div><div class="gmail-HOEnZb"><div class="gmail-h5">______________________________<wbr>_________________<br>

Users mailing list<br>

<a href="mailto:Users@ovirt.org">Users@ovirt.org</a><br>

<a href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/users</a><br>

</div></div></blockquote></div><br></div></div>