Re: [virt-devel] Timeout failures in the OST

2 Nov 2020

      On Mon, Nov 2, 2020 at 2:27 PM Benny Zlotnik <bzlotnik@redhat.com> wrote:

Issues like this belong to oVirt devel mailing list.
...
looks like live merge failed[1]:
2020-11-01 10:31:49,903+0100 ERROR (periodic/0) [virt.vm] (vmId='fcabfd2e-2937-4419-9b25-78fdd2b9c7c2') Unable to get watermarks for drive vdb: invalid argument: invalid path /rhev/data-center/mnt/blockSD/97b6175e-b6a9-419b-bd54-7c1e38c1bf71/images/fbb11a06-b8ef-4078-9530-978e7ca8ea0b/a911ad89-e461-4db4-88bf-a5d6590608b5 not assigned to domain (vm:1213)
This may be a bug in vdsm live merge flow, tring to monitor a volume
after the volume
was already removed, or it may be libvirt/qemu bug.
...
...
2020-11-01 10:31:53,138+0100 ERROR (jsonrpc/1) [virt.vm] (vmId='fcabfd2e-2937-4419-9b25-78fdd2b9c7c2') merge: libvirt does not support volume chain monitoring. Unable to perform live merge. drive: vdb, alias: ua-fbb11a06-b8ef-4078-9530-978e7ca8ea0b, chains: {} (vm:5411)
libvirt logs report:
...
2020-11-01 09:30:50.021+0000: 40137: error : virProcessRunInFork:1161 : internal error: child reported (status=125):
2020-11-01 09:30:50.025+0000: 40137: error : virProcessRunInFork:1161 : internal error: child reported (status=125): internal error: child reported (status=125):
2020-11-01 09:30:50.025+0000: 40137: warning : qemuDomainSnapshotDiskUpdateSource:15582 : Unable to move disk metadata on vm vm0
2020-11-01 09:31:45.539+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal error: unable to execute QEMU command 'blockdev-del': Node libvirt-6-format is in use
2020-11-01 09:31:45.539+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal error: unable to execute QEMU command 'blockdev-del': Block device libvirt-6-storage is in use
2020-11-01 09:31:45.540+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal error: unable to execute QEMU command 'blockdev-del': Node 'libvirt-7-format' is busy: node is used as backing hd of 'libvirt-6-format'
2020-11-01 09:31:45.541+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal error: unable to execute QEMU command 'blockdev-del': Block device libvirt-7-storage is in use
2020-11-01 09:31:45.900+0000: 40133: error : qemuDomainGetBlockInfo:12272 : invalid argument: invalid path /rhev/data-center/mnt/blockSD/97b6175e-b6a9-419b-bd54-7c1e38c1bf71/images/fbb11a06-b8ef-4078-9530-978e7ca8ea0b/a911ad89-e461-4db4-88bf-a5d6590608b5 not assigned to domain
These smell like libvirt/qemu bug.

Is this reproducible with RHEL 8.3?
...
Looks like the issue previously discussed in "[rhev-devel] Live storage migration instability in OST" two months ago has resurfaced
Another issue seems to be the removal of the source disk:
2020-11-01 10:31:48,056+0100 ERROR (tasks/3) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/192.168.202.2:_exports_nfs_share1/3ca0e492-45f2-4383-b149-439043408bce/images/_remove_me_fbb11a06-b8ef-4078-9530-978e7ca8ea0b can't be removed (fileSD:258)
2020-11-01 10:31:48,056+0100 ERROR (tasks/3) [storage.TaskManager.Task] (Task='70db80c2-076a-4ba1-a65d-821e6b5fe52c') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 251, in purgeImage
    self.oop.os.rmdir(toDelDir)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 238, in rmdir
    self._iop.rmdir(path)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 550, in rmdir
    return self._sendCommand("rmdir", {"path": path}, self.timeout)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 39] Directory not empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1947, in purgeImage
    domain.purgeImage(sdUUID, imgUUID, volsByImg, discard)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 855, in purgeImage
    self._manifest.purgeImage(sdUUID, imgUUID, volsImgs, discard)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 259, in purgeImage
    raise se.ImageDeleteError("%s %s" % (imgUUID, str(e)))
vdsm.storage.exception.ImageDeleteError: Could not remove all image's volumes: ('fbb11a06-b8ef-4078-9530-978e7ca8ea0b [Errno 39] Directory not empty',)
But it's unclear what the leftover is
Maybe be leftover from previous failed LSM. I think we need a better
error message
here, it should list the files in the non-empty directory.
...
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12782/...
On Mon, Nov 2, 2020 at 1:44 PM Steven Rosenberg <srosenbe@redhat.com> wrote:
...
Dear Benny,
Thank you for your response.
Here is the timeout engine log from one of the ps 45 failures [1].
It seems like this timeout is related to the engine failing and the ost scripts are not designed to detect the fail, thus timing out:
2020-11-01 10:31:55,175+01 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [live_storage_migration] Command id: '60b3f7fc-93db-48b2-82a1-8a93c47e18e1 failed child command status for step 'MERGE_STATUS'
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12796/...
With Best Regards.
Steven.
On Mon, Nov 2, 2020 at 12:51 PM Benny Zlotnik <bzlotnik@redhat.com> wrote:
...
Can you link to the relevant engine/vdsm logs?
The timeout in the tests indicates that the desired state wasn't
reached so test logs don't provide the information about what exactly
happened
On Sun, Nov 1, 2020 at 5:10 PM Steven Rosenberg <srosenbe@redhat.com> wrote:
...
Dear virt-devel,
We are currently experiencing many timeout failures in various patch sets for a gerrit issue 11395 [1].
The timeouts occur intermittently and seem to be unrelated to the changes which are only in the 004 module [2] and should have only affected VM1 / Disk1.
We could use some advice on addressing these issues as well as a review of the patch to ensure we can move this patch forward. The patch sets and relevant timeouts are as follows:
PS 40:
test_live_storage_migration – test 004 [3]
PS 41:
on test_verify_engine_backup – test 002 [4]
PS 43:
on test_virtual_machines - test 100 [5]
PS 45:
on test_live_storage_migration – 004 [6]
[1] https://gerrit.ovirt.org/#/c/111395/
[2] basic-suite-master/test-scenarios/004_basic_sanity.py
[3] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12628/...
[4] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12659/...
[5] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12744/...
[6] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/12782/...
With Best Regards.
Steven

Nir Soffer

tags

participants (1)