I'll post second part there.
Unfotunately I can't use fedora as ovirt node(unsupported), and share hangs only after some time,
I'm trying to find out what type of IO, leads to this hang, I'll try on other OSes if I'll find what
to try.
But first part is directly related to ovirt, I think.


--



Tuesday, January 23, 2018, 21:59:12:





On Tue, Jan 23, 2018 at 6:47 PM, Sergey Kulikov <serg_k@msm.ru> wrote:

Or maybe somebody can point me to the right place for submitting this?
Thanks. :)

CentOS have a bugtracker[1], but I think it's worthwhile understanding if it is reproducible with other OS. Fedora, for example.
Y.

[1]
https://bugs.centos.org/main_page.php
---



Monday, January 22, 2018, 14:10:53:

> This is test environment, running Centos 7.4, oVirt 4.2.0, kernel 3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)
>
>
> 1. Can't force NFS to 4.0.
> Some time ago, I've set my NFS version for all storage domains to V4, because there was a bug with Netapp data ontap 8.x
> and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,
> so there were no problems related to NFS, after some time Centos 7.4 was released, and I've noticed that mount points started to hang again,
> NFS was mounted with vers=4.1, and it's not possible to change to 4.0, both options "V4" and "V4.1" mounts as 4.1. Looks like V4 option is
> system default version for 4.X, and as I know it was changed in Centos 7.4 from 4.0 to 4.1, maybe 4.0 option should be added
> to force 4.0 version? because adding vers=/nfsvers= in "Additional mount options" is denied by ovirt.
> I know, I can turn it off on netapp side, but there may be situations where storage is out of control. And 4.0 version can't be
> set on ovirt side.
>
> 2. This bug isn't directly related to ovirt, but affects it.
> Don't really shure that this is right place to report.
> As I've said before there were bug with NFS 4.1, Netapp data ontap 8 and RHEL 7.x, but it was fixed in otap 9.x,
> Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D
> After updating to centos 7.4 nfs domains in ovirt started to hang\lock again, This happens randomly, on random hosts, after few
> days of uptime, entire datacenter goes offline, hosts down, storage domains down, some vms in UP and some in unknown state, but
> actually VMs are working, HostedEngine also working, but I can't control the environment.
> There are many hanging ioprocess(>1300) and vdsm processes(>1300) on some hosts, also there are some dd commands, that are checking
> storage hanging:
>         ├─vdsmd─┬─2*[dd]
>         │       ├─1304*[ioprocess───{ioprocess}]
>         │       ├─12*[ioprocess───4*[{ioprocess}]]
>         │       └─1365*[{vdsmd}]
> vdsm     19470  0.0  0.0   4360   348 ?        D<   Jan21   0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_iso/6cd147b4-8039-4f8a-8aa7-5fd444454d81/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct
> vdsm     40707  0.0  0.0   4360   348 ?        D<   00:44   0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_export/58d9e2c2-8fef-4abc-be13-a273d6af320f/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct
>
> vdsm is hanging at 100% cpu load
> If I'll try to ls this files ls will hang.
>
> I've made some dump of traffic, so looks like problem with STATID, I've found 2 issues on RedHat web site, but they aren't
> publically available, so i can't read the solution:
>
https://access.redhat.com/solutions/3214331   (in my case I have STATEID test)
>
https://access.redhat.com/solutions/3164451   (in my case there is no manager thread)
> But it looks' that I've another issue with stateid,
> According to dumps my hosts are sending: TEST_STATEID
> netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)
> After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH, OPEN, ACCESS, GETATTR
> Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205
> Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096
> Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID
>
>
> Entire conversaion looks like:
> No.     Time           Source             Destination       Protocol  Length Info
>       1 0.000000       10._host_          10._netapp_        NFS      238    V4 Call (Reply In 2) TEST_STATEID
>       2 0.000251       10._netapp_        10._host_          NFS      170    V4 Reply (Call In 1) TEST_STATEID (here is Status: NFS4ERR_BAD_STATEID (10025))
>       3 0.000352       10._host_          10._netapp_        NFS      338    V4 Call (Reply In 4) OPEN DH: 0xa2c3ad28/
>       4 0.000857       10._netapp_        10._host_          NFS      394    V4 Reply (Call In 3) OPEN StateID: 0xa205
>       5 0.000934       10._host_          10._netapp_        NFS      302    V4 Call (Reply In 8) READ StateID: 0xca5f Offset: 0 Len: 4096
>       6 0.000964       10._host_          10._netapp_        NFS      302    V4 Call (Reply In 9) READ StateID: 0xca5f Offset: 0 Len: 4096
>       7 0.001133       10._netapp_        10._host_          TCP      70     2049 → 683 [ACK] Seq=425 Ack=901 Win=10240 Len=0 TSval=225608100 TSecr=302215289
>       8 0.001258       10._netapp_        10._host_          NFS      170    V4 Reply (Call In 5) READ Status: NFS4ERR_BAD_STATEID
>       9 0.001320       10._netapp_        10._host_          NFS      170    V4 Reply (Call In 6) READ Status: NFS4ERR_BAD_STATEID
>
> Sometimes clearing locks on netapp(vserver locks break) and killing dd\ioprocess will help for a while.
> Right now I've my test setup in this state, looks like lock problem is always with metadata\disk check, but not domain itself,
> I can read and write other files in this mountpoint from the same host.
>
> Hosts have 3.10.0-693.11.6.el7.x86_64 kernel, ovirt 4.2.0
> can't find out If it's Netapp or Centos bug.
> If somebody wants to look closer on dumps, I can mail them directly.

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users