Re: [virt-devel] Timeout failures in the OST

On Mon, Nov 2, 2020 at 2:27 PM Benny Zlotnik <bzlotnik@redhat.com> wrote: Issues like this belong to oVirt devel mailing list.
looks like live merge failed[1]: 2020-11-01 10:31:49,903+0100 ERROR (periodic/0) [virt.vm] (vmId='fcabfd2e-2937-4419-9b25-78fdd2b9c7c2') Unable to get watermarks for drive vdb: invalid argument: invalid path /rhev/data-center/mnt/blockSD/97b6175e-b6a9-419b-bd54-7c1e38c1bf71/images/fbb11a06-b8ef-4078-9530-978e7ca8ea0b/a911ad89-e461-4db4-88bf-a5d6590608b5 not assigned to domain (vm:1213)
This may be a bug in vdsm live merge flow, tring to monitor a volume after the volume was already removed, or it may be libvirt/qemu bug.
... 2020-11-01 10:31:53,138+0100 ERROR (jsonrpc/1) [virt.vm] (vmId='fcabfd2e-2937-4419-9b25-78fdd2b9c7c2') merge: libvirt does not support volume chain monitoring. Unable to perform live merge. drive: vdb, alias: ua-fbb11a06-b8ef-4078-9530-978e7ca8ea0b, chains: {} (vm:5411)
libvirt logs report: ... 2020-11-01 09:30:50.021+0000: 40137: error : virProcessRunInFork:1161 : internal error: child reported (status=125): 2020-11-01 09:30:50.025+0000: 40137: error : virProcessRunInFork:1161 : internal error: child reported (status=125): internal error: child reported (status=125): 2020-11-01 09:30:50.025+0000: 40137: warning : qemuDomainSnapshotDiskUpdateSource:15582 : Unable to move disk metadata on vm vm0 2020-11-01 09:31:45.539+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal error: unable to execute QEMU command 'blockdev-del': Node libvirt-6-format is in use 2020-11-01 09:31:45.539+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal error: unable to execute QEMU command 'blockdev-del': Block device libvirt-6-storage is in use 2020-11-01 09:31:45.540+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal error: unable to execute QEMU command 'blockdev-del': Node 'libvirt-7-format' is busy: node is used as backing hd of 'libvirt-6-format' 2020-11-01 09:31:45.541+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal error: unable to execute QEMU command 'blockdev-del': Block device libvirt-7-storage is in use 2020-11-01 09:31:45.900+0000: 40133: error : qemuDomainGetBlockInfo:12272 : invalid argument: invalid path /rhev/data-center/mnt/blockSD/97b6175e-b6a9-419b-bd54-7c1e38c1bf71/images/fbb11a06-b8ef-4078-9530-978e7ca8ea0b/a911ad89-e461-4db4-88bf-a5d6590608b5 not assigned to domain
These smell like libvirt/qemu bug. Is this reproducible with RHEL 8.3?
Looks like the issue previously discussed in "[rhev-devel] Live storage migration instability in OST" two months ago has resurfaced
Another issue seems to be the removal of the source disk: 2020-11-01 10:31:48,056+0100 ERROR (tasks/3) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/192.168.202.2:_exports_nfs_share1/3ca0e492-45f2-4383-b149-439043408bce/images/_remove_me_fbb11a06-b8ef-4078-9530-978e7ca8ea0b can't be removed (fileSD:258) 2020-11-01 10:31:48,056+0100 ERROR (tasks/3) [storage.TaskManager.Task] (Task='70db80c2-076a-4ba1-a65d-821e6b5fe52c') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 251, in purgeImage self.oop.os.rmdir(toDelDir) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 238, in rmdir self._iop.rmdir(path) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 550, in rmdir return self._sendCommand("rmdir", {"path": path}, self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 39] Directory not empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1947, in purgeImage domain.purgeImage(sdUUID, imgUUID, volsByImg, discard) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 855, in purgeImage self._manifest.purgeImage(sdUUID, imgUUID, volsImgs, discard) File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 259, in purgeImage raise se.ImageDeleteError("%s %s" % (imgUUID, str(e))) vdsm.storage.exception.ImageDeleteError: Could not remove all image's volumes: ('fbb11a06-b8ef-4078-9530-978e7ca8ea0b [Errno 39] Directory not empty',)
But it's unclear what the leftover is
Maybe be leftover from previous failed LSM. I think we need a better error message here, it should list the files in the non-empty directory.
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12782/...
On Mon, Nov 2, 2020 at 1:44 PM Steven Rosenberg <srosenbe@redhat.com> wrote:
Dear Benny,
Thank you for your response.
Here is the timeout engine log from one of the ps 45 failures [1].
It seems like this timeout is related to the engine failing and the ost scripts are not designed to detect the fail, thus timing out:
2020-11-01 10:31:55,175+01 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [live_storage_migration] Command id: '60b3f7fc-93db-48b2-82a1-8a93c47e18e1 failed child command status for step 'MERGE_STATUS'
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12796/...
With Best Regards.
Steven.
On Mon, Nov 2, 2020 at 12:51 PM Benny Zlotnik <bzlotnik@redhat.com> wrote:
Can you link to the relevant engine/vdsm logs? The timeout in the tests indicates that the desired state wasn't reached so test logs don't provide the information about what exactly happened
On Sun, Nov 1, 2020 at 5:10 PM Steven Rosenberg <srosenbe@redhat.com> wrote:
Dear virt-devel,
We are currently experiencing many timeout failures in various patch sets for a gerrit issue 11395 [1].
The timeouts occur intermittently and seem to be unrelated to the changes which are only in the 004 module [2] and should have only affected VM1 / Disk1.
We could use some advice on addressing these issues as well as a review of the patch to ensure we can move this patch forward. The patch sets and relevant timeouts are as follows:
PS 40:
test_live_storage_migration – test 004 [3]
PS 41:
on test_verify_engine_backup – test 002 [4]
PS 43:
on test_virtual_machines - test 100 [5]
PS 45:
on test_live_storage_migration – 004 [6]
[1] https://gerrit.ovirt.org/#/c/111395/ [2] basic-suite-master/test-scenarios/004_basic_sanity.py [3] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12628/... [4] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12659/... [5] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12744/... [6] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12782/...
With Best Regards.
Steven
participants (1)
-
Nir Soffer