On Tue, Jan 12, 2016 at 9:32 AM, Markus Stockhausen
<stockhausen(a)collogia.de> wrote:
Hi there,
we got a nasty situation yesterday in our OVirt 3.5.6 environment.
We ran a LSM that failed during the cleanup operation. To be precise
when the process deleted an image on the source NFS storage.
Engine log gives:
2016-01-11 20:49:45,120 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand]
(org.ovirt.thread.pool-8-thread-14) [77277f0] START, DeleteImageGroupVDSCommand(
storagePoolId = 94ed7a19-fade-4bd6-83f2-2cbb2f730b95, ignoreFailoverLimit = false,
storageDomainId = 272ec473-6041-42ee-bd1a-732789dd18d4, imageGroupId =
aed132ef-703a-44d0-b875-db8c0d2c1a92, postZeros = false, forceDelete = false), log id:
b52d59c
...
2016-01-11 20:50:45,206 ERROR
[org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand]
(org.ovirt.thread.pool-8-thread-14) [77277f0] Failed in DeleteImageGroupVDS method
VDSM (SPM) log gives:
Thread-97::DEBUG::2016-01-11
20:49:45,737::fileSD::384::Storage.StorageDomain::(deleteImage) Removing file:
/rhev/data-center/mnt/1.2.3.4:_var_nas2_OVirtIB/272ec473-6041-42ee-bd1a-732789dd18d4/images/_remojzBd1r/0d623afb-291e-4f4c-acba-caecb125c4ed
...
Thread-97::ERROR::2016-01-11
20:50:45,737::task::866::Storage.TaskManager.Task::(_setError)
Task=`cd477878-47b4-44b1-85a3-b5da19543a5e`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 873, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 1549, in deleteImage
pool.deleteImage(dom, imgUUID, volsByImg)
File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper
return method(self, *args, **kwargs)
File "/usr/share/vdsm/storage/sp.py", line 1884, in deleteImage
domain.deleteImage(domain.sdUUID, imgUUID, volsByImg)
File "/usr/share/vdsm/storage/fileSD.py", line 385, in deleteImage
self.oop.os.remove(volPath)
File "/usr/share/vdsm/storage/outOfProcess.py", line 245, in remove
self._iop.unlink(path)
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in
unlink
return self._sendCommand("unlink", {"path": path}, self.timeout)
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in
_sendCommand
raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
You stumbled into
https://bugzilla.redhat.com/1270220
Reading the docs I got the idea that vdsm default 60 second timeout
for IO operations might be changed within /etc/vdsm/vdsm.conf
[irs]
process_pool_timeout = 180
Can anyone confirm that this will solve the problem?
Yes, this is the correct option.
But note that deleting an image on nfs means 3 unlink operations per volume.
If you have an image with one snapshot, that means 2 volumes, and 6
unlink calls.
If unlink takes 70 seconds (timing out with current 60 seconds
timeout), deleting
the image with one snaphost will take 420 seconds.
On the engine side, engine wait until deleteImage finish, or until vdsTimeout
expired (by default 180 seconds), so you may need to increase the engine
timeout as well.
While engine wait for deleteImage to finish, no other spm operation can run.
So increasing the timeout is not the correct solution. You should check why your
storage needs more then 60 seconds to perform unlink operation and change
your setup so unlink works in a timely manner.
As a start, it would be useful to see the results of nfsstat on the
host experiencing
the slow deletes.
In master we perform now the delteImage operation in a background task, so
slow unlink should not effect the engine side, and you can increase
process_pool_timeout
as needed.
See
https://github.com/oVirt/vdsm/commit/3239e74d1a9087352fca454926224f47272d...
We don't plan to backport this change to 3.6 since it is risky and
does not fix the root
cause, which is the slow nfs server, but if you want to test it, I can
make a patch for 3.6.
Nir