<html><head><title>Re: [ovirt-users] Found some bugs with NFS.</title>
<META http-equiv=Content-Type content="text/html; charset=utf-8">
</head>
<body>
<br><br>
<span style=" font-family:'Courier New'; font-size: 9pt;">I'll post second part there.<br>
Unfotunately I can't use fedora as ovirt node(unsupported), and share hangs only after some time,<br>
I'm trying to find out what type of IO, leads to this hang, I'll try on other OSes if I'll find what<br>
to try.<br>
But first part is directly related to ovirt, I think.<br>
<br>
<br>
<span style=" font-family:'arial'; font-size: 8pt; color: #c0c0c0;"><i>-- <br>
<br>
<br>
<br>
Tuesday, January 23, 2018, 21:59:12:<br>
<br>
</i></span></span><table>
<tr>
<td width=2 bgcolor= #0000ff><br>
</td>
<td><br><br>
<br>
<span style=" font-family:'courier new'; font-size: 9pt;">On Tue, Jan 23, 2018 at 6:47 PM, Sergey Kulikov <</span><a style=" font-family:'courier new'; font-size: 9pt;" href="mailto:serg_k@msm.ru">serg_k@msm.ru</a><span style=" font-family:'courier new'; font-size: 9pt;">> wrote:<br>
<br>
Or maybe somebody can point me to the right place for submitting this?<br>
Thanks. :)<br>
<br>
CentOS have a bugtracker[1], but I think it's worthwhile understanding if it is reproducible with other OS. Fedora, for example.<br>
Y.<br>
<br>
[1] </span><a style=" font-family:'courier new'; font-size: 9pt;" href="https://bugs.centos.org/main_page.php">https://bugs.centos.org/main_page.php</a> <br>
<span style=" font-family:'courier new'; font-size: 9pt;">---<br>
<br>
<br>
<br>
Monday, January 22, 2018, 14:10:53:<br>
<br>
> This is test environment, running Centos 7.4, oVirt 4.2.0, kernel 3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)<br>
><br>
><br>
> 1. Can't force NFS to 4.0.<br>
> Some time ago, I've set my NFS version for all storage domains to V4, because there was a bug with Netapp data ontap 8.x<br>
> and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,<br>
> so there were no problems related to NFS, after some time Centos 7.4 was released, and I've noticed that mount points started to hang again,<br>
> NFS was mounted with vers=4.1, and it's not possible to change to 4.0, both options "V4" and "V4.1" mounts as 4.1. Looks like V4 option is<br>
> system default version for 4.X, and as I know it was changed in Centos 7.4 from 4.0 to 4.1, maybe 4.0 option should be added<br>
> to force 4.0 version? because adding vers=/nfsvers= in "Additional mount options" is denied by ovirt.<br>
> I know, I can turn it off on netapp side, but there may be situations where storage is out of control. And 4.0 version can't be<br>
> set on ovirt side.<br>
><br>
> 2. This bug isn't directly related to ovirt, but affects it.<br>
> Don't really shure that this is right place to report.<br>
> As I've said before there were bug with NFS 4.1, Netapp data ontap 8 and RHEL 7.x, but it was fixed in otap 9.x,<br>
> Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D<br>
> After updating to centos 7.4 nfs domains in ovirt started to hang\lock again, This happens randomly, on random hosts, after few<br>
> days of uptime, entire datacenter goes offline, hosts down, storage domains down, some vms in UP and some in unknown state, but<br>
> actually VMs are working, HostedEngine also working, but I can't control the environment.<br>
> There are many hanging ioprocess(>1300) and vdsm processes(>1300) on some hosts, also there are some dd commands, that are checking<br>
> storage hanging:<br>
> ├─vdsmd─┬─2*[dd]<br>
> │ ├─1304*[ioprocess───{ioprocess}]<br>
> │ ├─12*[ioprocess───4*[{ioprocess}]]<br>
> │ └─1365*[{vdsmd}]<br>
> vdsm 19470 0.0 0.0 4360 348 ? D< Jan21 0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_iso/6cd147b4-8039-4f8a-8aa7-5fd444454d81/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct<br>
> vdsm 40707 0.0 0.0 4360 348 ? D< 00:44 0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_export/58d9e2c2-8fef-4abc-be13-a273d6af320f/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct<br>
><br>
> vdsm is hanging at 100% cpu load<br>
> If I'll try to ls this files ls will hang.<br>
><br>
> I've made some dump of traffic, so looks like problem with STATID, I've found 2 issues on RedHat web site, but they aren't<br>
> publically available, so i can't read the solution:<br>
> </span><a style=" font-family:'courier new'; font-size: 9pt;" href="https://access.redhat.com/solutions/3214331">https://access.redhat.com/solutions/3214331</a><span style=" font-family:'courier new'; font-size: 9pt;"> (in my case I have STATEID test)<br>
> </span><a style=" font-family:'courier new'; font-size: 9pt;" href="https://access.redhat.com/solutions/3164451">https://access.redhat.com/solutions/3164451</a><span style=" font-family:'courier new'; font-size: 9pt;"> (in my case there is no manager thread)<br>
> But it looks' that I've another issue with stateid,<br>
> According to dumps my hosts are sending: TEST_STATEID<br>
> netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)<br>
> After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH, OPEN, ACCESS, GETATTR<br>
> Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205<br>
> Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096<br>
> Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID<br>
><br>
><br>
> Entire conversaion looks like:<br>
> No. Time Source Destination Protocol Length Info<br>
> 1 0.000000 10._host_ 10._netapp_ NFS 238 V4 Call (Reply In 2) TEST_STATEID<br>
> 2 0.000251 10._netapp_ 10._host_ NFS 170 V4 Reply (Call In 1) TEST_STATEID (here is Status: NFS4ERR_BAD_STATEID (10025))<br>
> 3 0.000352 10._host_ 10._netapp_ NFS 338 V4 Call (Reply In 4) OPEN DH: 0xa2c3ad28/<br>
> 4 0.000857 10._netapp_ 10._host_ NFS 394 V4 Reply (Call In 3) OPEN StateID: 0xa205<br>
> 5 0.000934 10._host_ 10._netapp_ NFS 302 V4 Call (Reply In 8) READ StateID: 0xca5f Offset: 0 Len: 4096<br>
> 6 0.000964 10._host_ 10._netapp_ NFS 302 V4 Call (Reply In 9) READ StateID: 0xca5f Offset: 0 Len: 4096<br>
> 7 0.001133 10._netapp_ 10._host_ TCP 70 2049 → 683 [ACK] Seq=425 Ack=901 Win=10240 Len=0 TSval=225608100 TSecr=302215289<br>
> 8 0.001258 10._netapp_ 10._host_ NFS 170 V4 Reply (Call In 5) READ Status: NFS4ERR_BAD_STATEID<br>
> 9 0.001320 10._netapp_ 10._host_ NFS 170 V4 Reply (Call In 6) READ Status: NFS4ERR_BAD_STATEID<br>
><br>
> Sometimes clearing locks on netapp(vserver locks break) and killing dd\ioprocess will help for a while.<br>
> Right now I've my test setup in this state, looks like lock problem is always with metadata\disk check, but not domain itself,<br>
> I can read and write other files in this mountpoint from the same host.<br>
><br>
> Hosts have 3.10.0-693.11.6.el7.x86_64 kernel, ovirt 4.2.0<br>
> can't find out If it's Netapp or Centos bug.<br>
> If somebody wants to look closer on dumps, I can mail them directly.<br>
<br>
_______________________________________________<br>
Users mailing list<br>
</span><a style=" font-family:'courier new'; font-size: 9pt;" href="mailto:Users@ovirt.org">Users@ovirt.org</a><br>
<a style=" font-family:'courier new'; font-size: 9pt;" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a></td>
</tr>
</table>
</body></html>