Re: [ovirt-users] NFS IO timeout configuration

12 Jan 2016

      On Tue, Jan 12, 2016 at 9:32 AM, Markus Stockhausen
<stockhausen@collogia.de> wrote:
...
Hi there,
we got a nasty situation yesterday in our OVirt 3.5.6 environment.
We ran a LSM that failed during the cleanup operation. To be precise
when the process deleted an image on the source NFS storage.
Engine log gives:
2016-01-11 20:49:45,120 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (org.ovirt.thread.pool-8-thread-14) [77277f0] START, DeleteImageGroupVDSCommand( storagePoolId = 94ed7a19-fade-4bd6-83f2-2cbb2f730b95, ignoreFailoverLimit = false, storageDomainId = 272ec473-6041-42ee-bd1a-732789dd18d4, imageGroupId = aed132ef-703a-44d0-b875-db8c0d2c1a92, postZeros = false, forceDelete = false), log id: b52d59c
...
2016-01-11 20:50:45,206 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (org.ovirt.thread.pool-8-thread-14) [77277f0] Failed in DeleteImageGroupVDS method
VDSM (SPM) log gives:
Thread-97::DEBUG::2016-01-11 20:49:45,737::fileSD::384::Storage.StorageDomain::(deleteImage) Removing file: /rhev/data-center/mnt/1.2.3.4:_var_nas2_OVirtIB/272ec473-6041-42ee-bd1a-732789dd18d4/images/_remojzBd1r/0d623afb-291e-4f4c-acba-caecb125c4ed
...
Thread-97::ERROR::2016-01-11 20:50:45,737::task::866::Storage.TaskManager.Task::(_setError) Task=`cd477878-47b4-44b1-85a3-b5da19543a5e`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1549, in deleteImage
    pool.deleteImage(dom, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1884, in deleteImage
    domain.deleteImage(domain.sdUUID, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/fileSD.py", line 385, in deleteImage
    self.oop.os.remove(volPath)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 245, in remove
    self._iop.unlink(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in unlink
    return self._sendCommand("unlink", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
You stumbled into  https://bugzilla.redhat.com/1270220
...
Reading the docs I got the idea that vdsm default 60 second timeout
for IO operations might be changed within /etc/vdsm/vdsm.conf
[irs]
process_pool_timeout = 180
Can anyone confirm that this will solve the problem?
Yes, this is the correct option.

But note that deleting an image on nfs means 3 unlink operations per volume.
If you have an image with one snapshot, that means 2 volumes, and 6
unlink calls.

If unlink takes 70 seconds (timing out with current 60 seconds
timeout), deleting
the image with one snaphost will take 420 seconds.

On the engine side, engine wait until deleteImage finish, or until vdsTimeout
expired (by default 180 seconds), so you may need to increase the engine
timeout as well.

While engine wait for deleteImage to finish, no other spm operation can run.

So increasing the timeout is not the correct solution. You should check why your
storage needs more then 60 seconds to perform unlink operation and change
your setup so unlink works in a timely manner.

As a start, it would be useful to see the results of nfsstat on the
host experiencing
the slow deletes.

In master we perform now the delteImage operation in a background task, so
slow unlink should not effect the engine side, and you can increase
process_pool_timeout
as needed.
See https://github.com/oVirt/vdsm/commit/3239e74d1a9087352fca454926224f47272da6c...

We don't plan to backport this change to 3.6 since it is risky and
does not fix the root
cause, which is the slow nfs server, but if you want to test it, I can
make a patch for 3.6.

Nir

Re: [ovirt-users] NFS IO timeout configuration

Nir Soffer