[ovirt-users] Found some bugs with NFS.
Sergey Kulikov
serg_k at msm.ru
Mon Jan 22 11:10:53 UTC 2018
This is test environment, running Centos 7.4, oVirt 4.2.0, kernel 3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)
1. Can't force NFS to 4.0.
Some time ago, I've set my NFS version for all storage domains to V4, because there was a bug with Netapp data ontap 8.x
and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,
so there were no problems related to NFS, after some time Centos 7.4 was released, and I've noticed that mount points started to hang again,
NFS was mounted with vers=4.1, and it's not possible to change to 4.0, both options "V4" and "V4.1" mounts as 4.1. Looks like V4 option is
system default version for 4.X, and as I know it was changed in Centos 7.4 from 4.0 to 4.1, maybe 4.0 option should be added
to force 4.0 version? because adding vers=/nfsvers= in "Additional mount options" is denied by ovirt.
I know, I can turn it off on netapp side, but there may be situations where storage is out of control. And 4.0 version can't be
set on ovirt side.
2. This bug isn't directly related to ovirt, but affects it.
Don't really shure that this is right place to report.
As I've said before there were bug with NFS 4.1, Netapp data ontap 8 and RHEL 7.x, but it was fixed in otap 9.x,
Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D
After updating to centos 7.4 nfs domains in ovirt started to hang\lock again, This happens randomly, on random hosts, after few
days of uptime, entire datacenter goes offline, hosts down, storage domains down, some vms in UP and some in unknown state, but
actually VMs are working, HostedEngine also working, but I can't control the environment.
There are many hanging ioprocess(>1300) and vdsm processes(>1300) on some hosts, also there are some dd commands, that are checking
storage hanging:
├─vdsmd─┬─2*[dd]
│ ├─1304*[ioprocess───{ioprocess}]
│ ├─12*[ioprocess───4*[{ioprocess}]]
│ └─1365*[{vdsmd}]
vdsm 19470 0.0 0.0 4360 348 ? D< Jan21 0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_iso/6cd147b4-8039-4f8a-8aa7-5fd444454d81/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct
vdsm 40707 0.0 0.0 4360 348 ? D< 00:44 0:00 /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_export/58d9e2c2-8fef-4abc-be13-a273d6af320f/dom_md/metadata of=/dev/null bs=4096 count=1 iflag=direct
vdsm is hanging at 100% cpu load
If I'll try to ls this files ls will hang.
I've made some dump of traffic, so looks like problem with STATID, I've found 2 issues on RedHat web site, but they aren't
publically available, so i can't read the solution:
https://access.redhat.com/solutions/3214331 (in my case I have STATEID test)
https://access.redhat.com/solutions/3164451 (in my case there is no manager thread)
But it looks' that I've another issue with stateid,
According to dumps my hosts are sending: TEST_STATEID
netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)
After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH, OPEN, ACCESS, GETATTR
Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205
Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096
Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID
Entire conversaion looks like:
No. Time Source Destination Protocol Length Info
1 0.000000 10._host_ 10._netapp_ NFS 238 V4 Call (Reply In 2) TEST_STATEID
2 0.000251 10._netapp_ 10._host_ NFS 170 V4 Reply (Call In 1) TEST_STATEID (here is Status: NFS4ERR_BAD_STATEID (10025))
3 0.000352 10._host_ 10._netapp_ NFS 338 V4 Call (Reply In 4) OPEN DH: 0xa2c3ad28/
4 0.000857 10._netapp_ 10._host_ NFS 394 V4 Reply (Call In 3) OPEN StateID: 0xa205
5 0.000934 10._host_ 10._netapp_ NFS 302 V4 Call (Reply In 8) READ StateID: 0xca5f Offset: 0 Len: 4096
6 0.000964 10._host_ 10._netapp_ NFS 302 V4 Call (Reply In 9) READ StateID: 0xca5f Offset: 0 Len: 4096
7 0.001133 10._netapp_ 10._host_ TCP 70 2049 → 683 [ACK] Seq=425 Ack=901 Win=10240 Len=0 TSval=225608100 TSecr=302215289
8 0.001258 10._netapp_ 10._host_ NFS 170 V4 Reply (Call In 5) READ Status: NFS4ERR_BAD_STATEID
9 0.001320 10._netapp_ 10._host_ NFS 170 V4 Reply (Call In 6) READ Status: NFS4ERR_BAD_STATEID
Sometimes clearing locks on netapp(vserver locks break) and killing dd\ioprocess will help for a while.
Right now I've my test setup in this state, looks like lock problem is always with metadata\disk check, but not domain itself,
I can read and write other files in this mountpoint from the same host.
Hosts have 3.10.0-693.11.6.el7.x86_64 kernel, ovirt 4.2.0
can't find out If it's Netapp or Centos bug.
If somebody wants to look closer on dumps, I can mail them directly.
--
More information about the Users
mailing list