Maybe found a workaround on the NFS server side, a option for the mountd service

-S Tell mountd to suspend/resume execution of the nfsd threads when-

ever the exports list is being reloaded. This avoids intermit-

tent access errors for clients that do NFS RPCs while the exports

are being reloaded, but introduces a delay in RPC response while

the reload is in progress. If mountd crashes while an exports

load is in progress, mountd must be restarted to get the nfsd

threads running again, if this option is used.

so far, i was able to reload the exports list twice, without any random suspended vm. lets see if this is a real solution or if i just had luck two times.

but i am still interested in parameters which make the vdsm more tolerant to short interruptions. instant suspend of a vm after such a short "outage" is not very nice.

On Wed, Jul 24, 2013 at 11:04 PM, squadra <squadra@gmail.com> wrote:

Hi Folks,

i got a Setup running with the following Specs

4 VM Hosts - CentOS 6.4 - latest Ovirt 3.2 from dreyou

vdsm-xmlrpc-4.10.3-0.36.23.el6.noarch

vdsm-cli-4.10.3-0.36.23.el6.noarch
vdsm-python-4.10.3-0.36.23.el6.x86_64
vdsm-4.10.3-0.36.23.el6.x86_64
qemu-kvm-rhev-tools-0.12.1.2-2.355.el6.5.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6.5.x86_64

qemu-img-rhev-0.12.1.2-2.355.el6.5.x86_64
gpxe-roms-qemu-0.9.7-6.9.el6.noarch

Management Node is also running latest 3.2 from dreyou

ovirt-engine-cli-3.2.0.10-1.el6.noarch

ovirt-engine-jbossas711-1-0.x86_64
ovirt-engine-tools-3.2.1-1.41.el6.noarch
ovirt-engine-backend-3.2.1-1.41.el6.noarch
ovirt-engine-sdk-3.2.0.9-1.el6.noarch
ovirt-engine-userportal-3.2.1-1.41.el6.noarch

ovirt-engine-setup-3.2.1-1.41.el6.noarch
ovirt-engine-webadmin-portal-3.2.1-1.41.el6.noarch
ovirt-engine-dbscripts-3.2.1-1.41.el6.noarch
ovirt-engine-3.2.1-1.41.el6.noarch
ovirt-engine-genericapi-3.2.1-1.41.el6.noarch

ovirt-engine-restapi-3.2.1-1.41.el6.noarch

VM are running from a Freebsd 9.1 NFS Server, which works absolutly flawless until i need to reload the /etc/exports File on the NFS Server. For this, the NFS Server itself doesnt need to be restarted, just the mountd Daemon is "Hup´ed".

But after sending a HUP to the mountd, Ovirt immidiatly thinks that there was a problem with the storage backend and suspends random some VM. Luckily this VM can be resumed instant without further issues.

The VM Hosts dont show any NFS related errors, so i expect the vdsm or engine to check the nfs server continous.
The only thing i can find in the vdsm.log of a related host is

-- snip --

Thread-539::DEBUG::2013-07-24 22:29:46,935::resourceManager::830::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-539::DEBUG::2013-07-24 22:29:46,935::resourceManager::864::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}

Thread-539::DEBUG::2013-07-24 22:29:46,935::task::957::TaskManager.Task::(_decref) Task=`9332cd24-d899-4226-b0a2-93544ee737b4`::ref 0 aborting False
libvirtEventLoop::INFO::2013-07-24 22:29:55,142::libvirtvm::2509::vm.Vm::(_onAbnormalStop) vmId=`244f6c8d-bc2b-4669-8f6d-bd957222b946`::abnormal vm stop device virtio-disk0 error e

other
libvirtEventLoop::DEBUG::2013-07-24 22:29:55,143::libvirtvm::3079::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`244f6c8d-bc2b-4669-8f6d-bd957222b946`::event Suspended detail 2 opaque No
ne

libvirtEventLoop::INFO::2013-07-24 22:29:55,143::libvirtvm::2509::vm.Vm::(_onAbnormalStop) vmId=`244f6c8d-bc2b-4669-8f6d-bd957222b946`::abnormal vm stop device virtio-disk0 error e
other

-- snip --

i am a little bit at a dead end currently, since reloading a nfs servers export table isnt a unusual task and everything is working like expected. just ovirt seems way to picky.

is there any possibility to make this check a little bit more tolerant?

i try setting "sd_health_check_delay = 30" in vdsm.conf, but this didnt change anything.

anyone got an idea how i can get rid of this annoying problem?

Cheers,

Juergen

--
Sent from the Delta quadrant using Borg technology!

Sent from the Delta quadrant using Borg technology!