[ovirt-users] NFS IO timeout configuration

Tue Jan 12 09:16:28 UTC 2016

On Tue, Jan 12, 2016 at 9:32 AM, Markus Stockhausen
<stockhausen at collogia.de> wrote:
> Hi there,
>
> we got a nasty situation yesterday in our OVirt 3.5.6 environment.
> We ran a LSM that failed during the cleanup operation. To be precise
> when the process deleted an image on the source NFS storage.
>
> Engine log gives:
>
> 2016-01-11 20:49:45,120 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (org.ovirt.thread.pool-8-thread-14) [77277f0] START, DeleteImageGroupVDSCommand( storagePoolId = 94ed7a19-fade-4bd6-83f2-2cbb2f730b95, ignoreFailoverLimit = false, storageDomainId = 272ec473-6041-42ee-bd1a-732789dd18d4, imageGroupId = aed132ef-703a-44d0-b875-db8c0d2c1a92, postZeros = false, forceDelete = false), log id: b52d59c
> ...
> 2016-01-11 20:50:45,206 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (org.ovirt.thread.pool-8-thread-14) [77277f0] Failed in DeleteImageGroupVDS method
>
> VDSM (SPM) log gives:
>
> Thread-97::DEBUG::2016-01-11 20:49:45,737::fileSD::384::Storage.StorageDomain::(deleteImage) Removing file: /rhev/data-center/mnt/1.2.3.4:_var_nas2_OVirtIB/272ec473-6041-42ee-bd1a-732789dd18d4/images/_remojzBd1r/0d623afb-291e-4f4c-acba-caecb125c4ed
> ...
> Thread-97::ERROR::2016-01-11 20:50:45,737::task::866::Storage.TaskManager.Task::(_setError) Task=`cd477878-47b4-44b1-85a3-b5da19543a5e`::Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 873, in _run
>     return fn(*args, **kargs)
>   File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/storage/hsm.py", line 1549, in deleteImage
>     pool.deleteImage(dom, imgUUID, volsByImg)
>   File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper
>     return method(self, *args, **kwargs)
>   File "/usr/share/vdsm/storage/sp.py", line 1884, in deleteImage
>     domain.deleteImage(domain.sdUUID, imgUUID, volsByImg)
>   File "/usr/share/vdsm/storage/fileSD.py", line 385, in deleteImage
>     self.oop.os.remove(volPath)
>   File "/usr/share/vdsm/storage/outOfProcess.py", line 245, in remove
>     self._iop.unlink(path)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in unlink
>     return self._sendCommand("unlink", {"path": path}, self.timeout)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand
>     raise Timeout(os.strerror(errno.ETIMEDOUT))
> Timeout: Connection timed out

You stumbled into  https://bugzilla.redhat.com/1270220

>
> Reading the docs I got the idea that vdsm default 60 second timeout
> for IO operations might be changed within /etc/vdsm/vdsm.conf
>
> [irs]
> process_pool_timeout = 180
>
> Can anyone confirm that this will solve the problem?

Yes, this is the correct option.

But note that deleting an image on nfs means 3 unlink operations per volume.
If you have an image with one snapshot, that means 2 volumes, and 6
unlink calls.

If unlink takes 70 seconds (timing out with current 60 seconds
timeout), deleting
the image with one snaphost will take 420 seconds.

On the engine side, engine wait until deleteImage finish, or until vdsTimeout
expired (by default 180 seconds), so you may need to increase the engine
timeout as well.

While engine wait for deleteImage to finish, no other spm operation can run.

So increasing the timeout is not the correct solution. You should check why your
storage needs more then 60 seconds to perform unlink operation and change
your setup so unlink works in a timely manner.

As a start, it would be useful to see the results of nfsstat on the
host experiencing
the slow deletes.

In master we perform now the delteImage operation in a background task, so
slow unlink should not effect the engine side, and you can increase
process_pool_timeout
as needed.
See https://github.com/oVirt/vdsm/commit/3239e74d1a9087352fca454926224f47272da6c5

We don't plan to backport this change to 3.6 since it is risky and
does not fix the root
cause, which is the slow nfs server, but if you want to test it, I can
make a patch for 3.6.

Nir