[Users] Guests are paused without error message while doing maintenance on NFS storage

Wed Jan 23 14:32:54 UTC 2013

Hi,

this is a bit complex issue, so I´l try and be as clear as possible. We are running oVirt-3.1 in our production environment, based on minimal Fedora 17 installs. We have 4xHP 380's (Intel) running in one cluster, and 2xSun 7310's (AMD) in another cluster. They have shared storage over NFS to a FreeBSD-based system that uses ZFS as a filesystem. The storage boots off of a mirrored ZFS pool made up of two USB's that only houses /, while /var, /usr, etc. lies on a separate ZFS pool made up of the rest of the HDD's in the system. It looks like this:

FS                                            MOUNTPOINT
pool1 (The mirrored USB's)    none
pool1/root                               / (mounted ro)
pool2 (The regular HDD's)     none
pool2/root                               none
pool2/root/usr                        /usr
pool2/root/usr/home              /usr/home
pool2/root/usr/local               /usr/local
pool2/root/var                        /var
tmpfs                                       /tmp
pool2/export                          /export
pool2/export/ds1                   /export/ds1
pool2/export/ds1/data           /export/ds1/data
pool2/export/ds1/export       /export/ds1/export
pool2/export/ds1/iso             /export/ds1/iso
pool2/export/ds2                   /export/ds2
pool2/export/ds2/data           /export/ds2/data

/etc/exports:
/export/ds1/data     -alldirs -maproot=root 10.0.0.(all of the HV's)
/export/ds1/export -alldirs -maproot=root 10.0.0.(all of the HV's)
/export/ds1/iso        -alldirs -maproot=root 10.0.0.(all of the HV's)
/export/ds2/data     -alldirs -maproot=root 10.0.0.(all of the HV's)

To make those USB's last for as long as possible, / is usually mounted read-only. And when you need to change anything, you need to remount / to read-write, do the maintenance, and then remount back to read-only again. But when you issue a mount command, the VM's in oVirt pause. At first we didn´t understand that was actually the cause and tried to correlate the seemingly spontaneous pausing to just about anything, Then I was logged in to both oVirt's webadmin, and the storage at the same and issued "mount -uw /", and *boom*, random VM's started to pause:) Not all of them though, and not just every one in either cluster or something, it is completely random which VM's are paused every time.

# time mount -ur /

real 0m2.198s
user 0m0.000s
sys 0m0.002s

And here´s what vdsm on one of the HV's thought about that:
http://pastebin.com/MXjgpDfU

It begins with all VM's being "Up", then me issuing the remount on the storage from read-write to read-only which took 2 secs to complete, vdsm freaking out when it shortly looses it´s connections and lastly me at 14:34 making them all run again from webadmin.

Two things:
1) Does anyone know of any improvements that could be made on the storage side, apart from the obvious "stop remounting", since patching must eventually be made, configurations changed, and so on. A smarter way of configuring something? Booting from another ordinary HDD is sadly out of the question because there isn´t any room for any more, it´s full. And I would have really like it rather to boot from the HDD's that are already in there, but there are "other things" preventing that.
2) Nothing in engine was logged about it, no "Events" were made and nothing in engine.log that could indicate something had gone wrong at all. If it wasn´t serious enough to issue a warning, why disrupt the service with pausing the machines? Or at least automatically start them back up when connection to the storage almost immediately came back on it´s own. Saying nothing made it really hard to troubleshoot, since we didn´t initially knew at all what could be causing the pauses to happen, and when.

Best Regards
/Karli Sjöberg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130123/3277a73f/attachment-0001.html>