
On Tue, Jan 12, 2016 at 9:32 AM, Markus Stockhausen <stockhausen@collogia.de> wrote:
Hi there,
we got a nasty situation yesterday in our OVirt 3.5.6 environment. We ran a LSM that failed during the cleanup operation. To be precise when the process deleted an image on the source NFS storage.
Engine log gives:
2016-01-11 20:49:45,120 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (org.ovirt.thread.pool-8-thread-14) [77277f0] START, DeleteImageGroupVDSCommand( storagePoolId = 94ed7a19-fade-4bd6-83f2-2cbb2f730b95, ignoreFailoverLimit = false, storageDomainId = 272ec473-6041-42ee-bd1a-732789dd18d4, imageGroupId = aed132ef-703a-44d0-b875-db8c0d2c1a92, postZeros = false, forceDelete = false), log id: b52d59c ... 2016-01-11 20:50:45,206 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (org.ovirt.thread.pool-8-thread-14) [77277f0] Failed in DeleteImageGroupVDS method
VDSM (SPM) log gives:
Thread-97::DEBUG::2016-01-11 20:49:45,737::fileSD::384::Storage.StorageDomain::(deleteImage) Removing file: /rhev/data-center/mnt/1.2.3.4:_var_nas2_OVirtIB/272ec473-6041-42ee-bd1a-732789dd18d4/images/_remojzBd1r/0d623afb-291e-4f4c-acba-caecb125c4ed ... Thread-97::ERROR::2016-01-11 20:50:45,737::task::866::Storage.TaskManager.Task::(_setError) Task=`cd477878-47b4-44b1-85a3-b5da19543a5e`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1549, in deleteImage pool.deleteImage(dom, imgUUID, volsByImg) File "/usr/share/vdsm/storage/securable.py", line 77, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1884, in deleteImage domain.deleteImage(domain.sdUUID, imgUUID, volsByImg) File "/usr/share/vdsm/storage/fileSD.py", line 385, in deleteImage self.oop.os.remove(volPath) File "/usr/share/vdsm/storage/outOfProcess.py", line 245, in remove self._iop.unlink(path) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in unlink return self._sendCommand("unlink", {"path": path}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 385, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out
You stumbled into https://bugzilla.redhat.com/1270220
Reading the docs I got the idea that vdsm default 60 second timeout for IO operations might be changed within /etc/vdsm/vdsm.conf
[irs] process_pool_timeout = 180
Can anyone confirm that this will solve the problem?
Yes, this is the correct option. But note that deleting an image on nfs means 3 unlink operations per volume. If you have an image with one snapshot, that means 2 volumes, and 6 unlink calls. If unlink takes 70 seconds (timing out with current 60 seconds timeout), deleting the image with one snaphost will take 420 seconds. On the engine side, engine wait until deleteImage finish, or until vdsTimeout expired (by default 180 seconds), so you may need to increase the engine timeout as well. While engine wait for deleteImage to finish, no other spm operation can run. So increasing the timeout is not the correct solution. You should check why your storage needs more then 60 seconds to perform unlink operation and change your setup so unlink works in a timely manner. As a start, it would be useful to see the results of nfsstat on the host experiencing the slow deletes. In master we perform now the delteImage operation in a background task, so slow unlink should not effect the engine side, and you can increase process_pool_timeout as needed. See https://github.com/oVirt/vdsm/commit/3239e74d1a9087352fca454926224f47272da6c... We don't plan to backport this change to 3.6 since it is risky and does not fix the root cause, which is the slow nfs server, but if you want to test it, I can make a patch for 3.6. Nir