[ovirt-users] Found some bugs with NFS.
Yaniv Kaul
ykaul at redhat.com
Tue Jan 23 18:59:12 UTC 2018
On Tue, Jan 23, 2018 at 6:47 PM, Sergey Kulikov <serg_k at msm.ru> wrote:
>
> Or maybe somebody can point me to the right place for submitting this?
> Thanks. :)
>
> CentOS have a bugtracker[1], but I think it's worthwhile understanding if
it is reproducible with other OS. Fedora, for example.
Y.
[1] https://bugs.centos.org/main_page.php
> ---
>
>
>
> Monday, January 22, 2018, 14:10:53:
>
> > This is test environment, running Centos 7.4, oVirt 4.2.0, kernel
> 3.10.0-693.11.6.el7.x86_64 (3.10.0-693.11.1 and 3.10.0-693 have same bugs)
> >
> >
> > 1. Can't force NFS to 4.0.
> > Some time ago, I've set my NFS version for all storage domains to V4,
> because there was a bug with Netapp data ontap 8.x
> > and RHEL, using NFS 4.1(NFS mount started to hang after a while, STATEID
> problems) v4 on centos 7.2 and 7.3 was mounting NFS as 4.0,
> > so there were no problems related to NFS, after some time Centos 7.4 was
> released, and I've noticed that mount points started to hang again,
> > NFS was mounted with vers=4.1, and it's not possible to change to 4.0,
> both options "V4" and "V4.1" mounts as 4.1. Looks like V4 option is
> > system default version for 4.X, and as I know it was changed in Centos
> 7.4 from 4.0 to 4.1, maybe 4.0 option should be added
> > to force 4.0 version? because adding vers=/nfsvers= in "Additional mount
> options" is denied by ovirt.
> > I know, I can turn it off on netapp side, but there may be situations
> where storage is out of control. And 4.0 version can't be
> > set on ovirt side.
> >
> > 2. This bug isn't directly related to ovirt, but affects it.
> > Don't really shure that this is right place to report.
> > As I've said before there were bug with NFS 4.1, Netapp data ontap 8 and
> RHEL 7.x, but it was fixed in otap 9.x,
> > Now we have 9.x ONTAP on Netapp and it brought new bugs with RHEL 7.4 :D
> > After updating to centos 7.4 nfs domains in ovirt started to hang\lock
> again, This happens randomly, on random hosts, after few
> > days of uptime, entire datacenter goes offline, hosts down, storage
> domains down, some vms in UP and some in unknown state, but
> > actually VMs are working, HostedEngine also working, but I can't control
> the environment.
> > There are many hanging ioprocess(>1300) and vdsm processes(>1300) on
> some hosts, also there are some dd commands, that are checking
> > storage hanging:
> > ├─vdsmd─┬─2*[dd]
> > │ ├─1304*[ioprocess───{ioprocess}]
> > │ ├─12*[ioprocess───4*[{ioprocess}]]
> > │ └─1365*[{vdsmd}]
> > vdsm 19470 0.0 0.0 4360 348 ? D< Jan21 0:00
> /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_iso/
> 6cd147b4-8039-4f8a-8aa7-5fd444454d81/dom_md/metadata of=/dev/null bs=4096
> count=1 iflag=direct
> > vdsm 40707 0.0 0.0 4360 348 ? D< 00:44 0:00
> /usr/bin/dd if=/rhev/data-center/mnt/10.xx.xx.xx:_test__nfs__sas_
> export/58d9e2c2-8fef-4abc-be13-a273d6af320f/dom_md/metadata of=/dev/null
> bs=4096 count=1 iflag=direct
> >
> > vdsm is hanging at 100% cpu load
> > If I'll try to ls this files ls will hang.
> >
> > I've made some dump of traffic, so looks like problem with STATID, I've
> found 2 issues on RedHat web site, but they aren't
> > publically available, so i can't read the solution:
> > https://access.redhat.com/solutions/3214331 (in my case I have
> STATEID test)
> > https://access.redhat.com/solutions/3164451 (in my case there is no
> manager thread)
> > But it looks' that I've another issue with stateid,
> > According to dumps my hosts are sending: TEST_STATEID
> > netapp reply is: Status: NFS4ERR_BAD_STATEID (10025)
> > After this host sends: Network File System, Ops(5): SEQUENCE, PUTFH,
> OPEN, ACCESS, GETATTR
> > Reply: V4 Reply (Call In 17) OPEN StateID: 0xa205
> > Request: V4 Call (Reply In 22) READ StateID: 0xca5f Offset: 0 Len: 4096
> > Reply: V4 Reply (Call In 19) READ Status: NFS4ERR_BAD_STATEID
> >
> >
> > Entire conversaion looks like:
> > No. Time Source Destination Protocol
> Length Info
> > 1 0.000000 10._host_ 10._netapp_ NFS
> 238 V4 Call (Reply In 2) TEST_STATEID
> > 2 0.000251 10._netapp_ 10._host_ NFS
> 170 V4 Reply (Call In 1) TEST_STATEID (here is Status:
> NFS4ERR_BAD_STATEID (10025))
> > 3 0.000352 10._host_ 10._netapp_ NFS
> 338 V4 Call (Reply In 4) OPEN DH: 0xa2c3ad28/
> > 4 0.000857 10._netapp_ 10._host_ NFS
> 394 V4 Reply (Call In 3) OPEN StateID: 0xa205
> > 5 0.000934 10._host_ 10._netapp_ NFS
> 302 V4 Call (Reply In 8) READ StateID: 0xca5f Offset: 0 Len: 4096
> > 6 0.000964 10._host_ 10._netapp_ NFS
> 302 V4 Call (Reply In 9) READ StateID: 0xca5f Offset: 0 Len: 4096
> > 7 0.001133 10._netapp_ 10._host_ TCP
> 70 2049 → 683 [ACK] Seq=425 Ack=901 Win=10240 Len=0 TSval=225608100
> TSecr=302215289
> > 8 0.001258 10._netapp_ 10._host_ NFS
> 170 V4 Reply (Call In 5) READ Status: NFS4ERR_BAD_STATEID
> > 9 0.001320 10._netapp_ 10._host_ NFS
> 170 V4 Reply (Call In 6) READ Status: NFS4ERR_BAD_STATEID
> >
> > Sometimes clearing locks on netapp(vserver locks break) and killing
> dd\ioprocess will help for a while.
> > Right now I've my test setup in this state, looks like lock problem is
> always with metadata\disk check, but not domain itself,
> > I can read and write other files in this mountpoint from the same host.
> >
> > Hosts have 3.10.0-693.11.6.el7.x86_64 kernel, ovirt 4.2.0
> > can't find out If it's Netapp or Centos bug.
> > If somebody wants to look closer on dumps, I can mail them directly.
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20180123/cfd2bc6c/attachment.html>
More information about the Users
mailing list