Re: [ovirt-users] Found some bugs with NFS.

I'll post second part there.
Unfotunately I can't use fedora as ovirt node(unsupported), and share hangs only after some time,
I'm trying to find out what type of IO, leads to this hang, I'll try on other OSes if I'll find what
to try.
But first part is directly related to ovirt, I think.

--

Tuesday, January 23, 2018, 21:59:12:

On Tue, Jan 23, 2018 at 6:47 PM, Sergey Kulikov <serg_k@msm.ru> wrote:

Or maybe somebody can point me to the right place for submitting this?
Thanks. :)

CentOS have a bugtracker[1], but I think it's worthwhile understanding if it is reproducible with other OS. Fedora, for example.
Y.

[1] https://bugs.centos.org/main_page.php
---

Monday, January 22, 2018, 14:10:53:

> This is test environment, running Centos 7.4, oVirt 4.2.0, kernel 3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)
>
>
> 1. Can't force NFS to 4.0.
> Some time ago, I've set my NFS version for all storage domains to V4, because there was a bug with Netapp data ontap 8.x
> and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,
> so there were no problems related to NFS, after some time Centos 7.4 was released, and I've noticed that mount points started to hang again,
> NFS was mounted with vers=4.1, and it's not possible to change to 4.0, both options "V4" and "V4.1" mounts as 4.1. Looks like V4 option is
> system default version for 4.X, and as I know it was changed in Centos 7.4 from 4.0 to 4.1, maybe 4.0 option should be added
> to force 4.0 version? because adding vers=/nfsvers= in "Additional mount options" is denied by ovirt.
> I know, I can turn it off on netapp side, but there may be situations where storage is out of control. And 4.0 version can't be
> set on ovirt side.
>
> 2. This bug isn't directly related to ovirt, but affects it.
> Don't really shure that this is right place to report.
> As I've said before there were bug with NFS 4.1, Netapp data ontap 8 and RHEL 7.x, but it was fixed in otap 9.x,
> Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D
> After updating to centos 7.4 nfs domains in ovirt started to hang\lock again, This happens randomly, on random hosts, after few
> days of uptime, entire datacenter goes offline, hosts down, storage domains down, some vms in UP and some in unknown state, but
> actually VMs are working, HostedEngine also working, but I can't control the environment.
> There are many hanging ioprocess(>1300) and vdsm processes(>1300) on some hosts, also there are some dd commands, that are checking
> storage hanging:
> ├─vdsmd─┬─2*[dd]
> │ ├─1304*[ioprocess───{ioprocess}]
> │ ├─12*[ioprocess───4*[{ioprocess}]]
> │ └─1365*[{vdsmd}]
> vdsm 19470 0.0 0.0 4360 348 ? D< Jan21 0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_iso/6cd147b4-8039-4f8a-8aa7-5fd444454d81/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct
> vdsm 40707 0.0 0.0 4360 348 ? D< 00:44 0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_export/58d9e2c2-8fef-4abc-be13-a273d6af320f/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct
>
> vdsm is hanging at 100% cpu load
> If I'll try to ls this files ls will hang.
>
> I've made some dump of traffic, so looks like problem with STATID, I've found 2 issues on RedHat web site, but they aren't
> publically available, so i can't read the solution:
> https://access.redhat.com/solutions/3214331 (in my case I have STATEID test)
> https://access.redhat.com/solutions/3164451 (in my case there is no manager thread)
> But it looks' that I've another issue with stateid,
> According to dumps my hosts are sending: TEST_STATEID
> netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)
> After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH, OPEN, ACCESS, GETATTR
> Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205
> Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096
> Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID
>
>
> Entire conversaion looks like:
> No. Time Source Destination Protocol Length Info
> 1 0.000000 10._host_ 10._netapp_ NFS 238 V4 Call (Reply In 2) TEST_STATEID
> 2 0.000251 10._netapp_ 10._host_ NFS 170 V4 Reply (Call In 1) TEST_STATEID (here is Status: NFS4ERR_BAD_STATEID (10025))
> 3 0.000352 10._host_ 10._netapp_ NFS 338 V4 Call (Reply In 4) OPEN DH: 0xa2c3ad28/
> 4 0.000857 10._netapp_ 10._host_ NFS 394 V4 Reply (Call In 3) OPEN StateID: 0xa205
> 5 0.000934 10._host_ 10._netapp_ NFS 302 V4 Call (Reply In 8) READ StateID: 0xca5f Offset: 0 Len: 4096
> 6 0.000964 10._host_ 10._netapp_ NFS 302 V4 Call (Reply In 9) READ StateID: 0xca5f Offset: 0 Len: 4096
> 7 0.001133 10._netapp_ 10._host_ TCP 70 2049 → 683 [ACK] Seq=425 Ack=901 Win=10240 Len=0 TSval=225608100 TSecr=302215289
> 8 0.001258 10._netapp_ 10._host_ NFS 170 V4 Reply (Call In 5) READ Status: NFS4ERR_BAD_STATEID
> 9 0.001320 10._netapp_ 10._host_ NFS 170 V4 Reply (Call In 6) READ Status: NFS4ERR_BAD_STATEID
>
> Sometimes clearing locks on netapp(vserver locks break) and killing dd\ioprocess will help for a while.
> Right now I've my test setup in this state, looks like lock problem is always with metadata\disk check, but not domain itself,
> I can read and write other files in this mountpoint from the same host.
>
> Hosts have 3.10.0-693.11.6.el7.x86_64 kernel, ovirt 4.2.0
> can't find out If it's Netapp or Centos bug.
> If somebody wants to look closer on dumps, I can mail them directly.

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users