Found some bugs with NFS.

22 Jan 2018

      This is test environment, running Centos 7.4, oVirt 4.2.0, kernel 3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)

1. Can't force NFS to 4.0.
Some time ago, I've set my NFS version for all storage domains to V4, because there was a bug with Netapp data ontap 8.x
and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,
so there were no problems related to NFS, after some time Centos 7.4 was released, and I've noticed that mount points started to hang again,
NFS was mounted with vers=4.1, and it's not possible to change to 4.0, both options "V4" and "V4.1" mounts as 4.1. Looks like V4 option is 
system default version for 4.X, and as I know it was changed in Centos 7.4 from 4.0 to 4.1, maybe 4.0 option should be added
to force 4.0 version? because adding vers=/nfsvers= in "Additional mount options" is denied by ovirt.
I know, I can turn it off on netapp side, but there may be situations where storage is out of control. And 4.0 version can't be
set on ovirt side.

2. This bug isn't directly related to ovirt, but affects it.
Don't really shure that this is right place to report.
As I've said before there were bug with NFS 4.1, Netapp data ontap 8 and RHEL 7.x, but it was fixed in otap 9.x,
Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D
After updating to centos 7.4 nfs domains in ovirt started to hang\lock again, This happens randomly, on random hosts, after few
days of uptime, entire datacenter goes offline, hosts down, storage domains down, some vms in UP and some in unknown state, but
actually VMs are working, HostedEngine also working, but I can't control the environment.
There are many hanging ioprocess(>1300) and vdsm processes(>1300) on some hosts, also there are some dd commands, that are checking
storage hanging:
        ├─vdsmd─┬─2*[dd]
        │       ├─1304*[ioprocess───{ioprocess}]
        │       ├─12*[ioprocess───4*[{ioprocess}]]
        │       └─1365*[{vdsmd}]
vdsm     19470  0.0  0.0   4360   348 ?        D<   Jan21   0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_iso/6cd147b4-8039-4f8a-8aa7-5fd444454d81/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct
vdsm     40707  0.0  0.0   4360   348 ?        D<   00:44   0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_export/58d9e2c2-8fef-4abc-be13-a273d6af320f/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct

vdsm is hanging at 100% cpu load
If I'll try to ls this files ls will hang.

I've made some dump of traffic, so looks like problem with STATID, I've found 2 issues on RedHat web site, but they aren't
publically available, so i can't read the solution:
https://access.redhat.com/solutions/3214331   (in my case I have STATEID test)
https://access.redhat.com/solutions/3164451   (in my case there is no manager thread)
But it looks' that I've another issue with stateid,
According to dumps my hosts are sending: TEST_STATEID
netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)
After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH, OPEN, ACCESS, GETATTR
Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205
Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096
Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID

Entire conversaion looks like:
No.     Time           Source             Destination       Protocol  Length Info
      1 0.000000       10._host_          10._netapp_        NFS      238    V4 Call (Reply In 2) TEST_STATEID
      2 0.000251       10._netapp_        10._host_          NFS      170    V4 Reply (Call In 1) TEST_STATEID (here is Status: NFS4ERR_BAD_STATEID (10025))
      3 0.000352       10._host_          10._netapp_        NFS      338    V4 Call (Reply In 4) OPEN DH: 0xa2c3ad28/
      4 0.000857       10._netapp_        10._host_          NFS      394    V4 Reply (Call In 3) OPEN StateID: 0xa205
      5 0.000934       10._host_          10._netapp_        NFS      302    V4 Call (Reply In 8) READ StateID: 0xca5f Offset: 0 Len: 4096
      6 0.000964       10._host_          10._netapp_        NFS      302    V4 Call (Reply In 9) READ StateID: 0xca5f Offset: 0 Len: 4096
      7 0.001133       10._netapp_        10._host_          TCP      70     2049 → 683 [ACK] Seq=425 Ack=901 Win=10240 Len=0 TSval=225608100 TSecr=302215289
      8 0.001258       10._netapp_        10._host_          NFS      170    V4 Reply (Call In 5) READ Status: NFS4ERR_BAD_STATEID
      9 0.001320       10._netapp_        10._host_          NFS      170    V4 Reply (Call In 6) READ Status: NFS4ERR_BAD_STATEID

Sometimes clearing locks on netapp(vserver locks break) and killing dd\ioprocess will help for a while.
Right now I've my test setup in this state, looks like lock problem is always with metadata\disk check, but not domain itself,
I can read and write other files in this mountpoint from the same host.

Hosts have 3.10.0-693.11.6.el7.x86_64 kernel, ovirt 4.2.0
can't find out If it's Netapp or Centos bug.
If somebody wants to look closer on dumps, I can mail them directly.

--

Sergey Kulikov

Sergey Kulikov

Yaniv Kaul

Sergey Kulikov

tags

participants (2)