Re: [virt-devel] Timeout failures in the OST

Monday, 2 November 2020

On Mon, Nov 2, 2020 at 2:27 PM Benny Zlotnik <bzlotnik(a)redhat.com&gt; wrote:

Issues like this belong to oVirt devel mailing list.

...
 looks like live merge failed[1]:
 2020-11-01 10:31:49,903+0100 ERROR (periodic/0) [virt.vm]
(vmId='fcabfd2e-2937-4419-9b25-78fdd2b9c7c2') Unable to get watermarks for drive
vdb: invalid argument: invalid path
/rhev/data-center/mnt/blockSD/97b6175e-b6a9-419b-bd54-7c1e38c1bf71/images/fbb11a06-b8ef-4078-9530-978e7ca8ea0b/a911ad89-e461-4db4-88bf-a5d6590608b5
not assigned to domain (vm:1213) 
This may be a bug in vdsm live merge flow, tring to monitor a volume
after the volume
was already removed, or it may be libvirt/qemu bug.

...
 ...
 2020-11-01 10:31:53,138+0100 ERROR (jsonrpc/1) [virt.vm]
(vmId='fcabfd2e-2937-4419-9b25-78fdd2b9c7c2') merge: libvirt does not support
volume chain monitoring. Unable to perform live merge. drive: vdb, alias:
ua-fbb11a06-b8ef-4078-9530-978e7ca8ea0b, chains: {} (vm:5411)

 libvirt logs report:
 ...
 2020-11-01 09:30:50.021+0000: 40137: error : virProcessRunInFork:1161 : internal error:
child reported (status=125):
 2020-11-01 09:30:50.025+0000: 40137: error : virProcessRunInFork:1161 : internal error:
child reported (status=125): internal error: child reported (status=125):
 2020-11-01 09:30:50.025+0000: 40137: warning : qemuDomainSnapshotDiskUpdateSource:15582 :
Unable to move disk metadata on vm vm0
 2020-11-01 09:31:45.539+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal
error: unable to execute QEMU command 'blockdev-del': Node libvirt-6-format is in
use
 2020-11-01 09:31:45.539+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal
error: unable to execute QEMU command 'blockdev-del': Block device
libvirt-6-storage is in use
 2020-11-01 09:31:45.540+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal
error: unable to execute QEMU command 'blockdev-del': Node
'libvirt-7-format' is busy: node is used as backing hd of
'libvirt-6-format'
 2020-11-01 09:31:45.541+0000: 40134: error : qemuMonitorJSONCheckError:412 : internal
error: unable to execute QEMU command 'blockdev-del': Block device
libvirt-7-storage is in use
 2020-11-01 09:31:45.900+0000: 40133: error : qemuDomainGetBlockInfo:12272 : invalid
argument: invalid path
/rhev/data-center/mnt/blockSD/97b6175e-b6a9-419b-bd54-7c1e38c1bf71/images/fbb11a06-b8ef-4078-9530-978e7ca8ea0b/a911ad89-e461-4db4-88bf-a5d6590608b5
not assigned to domain 
These smell like libvirt/qemu bug.

Is this reproducible with RHEL 8.3?

...
 Looks like the issue previously discussed in "[rhev-devel] Live
storage migration instability in OST" two months ago has resurfaced

 Another issue seems to be the removal of the source disk:
 2020-11-01 10:31:48,056+0100 ERROR (tasks/3) [storage.StorageDomainManifest] removed
image dir:
/rhev/data-center/mnt/192.168.202.2:_exports_nfs_share1/3ca0e492-45f2-4383-b149-439043408bce/images/_remove_me_fbb11a06-b8ef-4078-9530-978e7ca8ea0b
can't be removed (fileSD:258)
 2020-11-01 10:31:48,056+0100 ERROR (tasks/3) [storage.TaskManager.Task]
(Task='70db80c2-076a-4ba1-a65d-821e6b5fe52c') Unexpected error (task:880)
 Traceback (most recent call last):
   File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 251, in
purgeImage
     self.oop.os.rmdir(toDelDir)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line
238, in rmdir
     self._iop.rmdir(path)
   File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 550, in
rmdir
     return self._sendCommand("rmdir", {"path": path}, self.timeout)
   File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in
_sendCommand
     raise OSError(errcode, errstr)
 OSError: [Errno 39] Directory not empty

 During handling of the above exception, another exception occurred:

 Traceback (most recent call last):
   File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in
_run
     return fn(*args, **kargs)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in
run
     return self.cmd(*self.argslist, **self.argsdict)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79,
in wrapper
     return method(self, *args, **kwargs)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1947, in
purgeImage
     domain.purgeImage(sdUUID, imgUUID, volsByImg, discard)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 855, in
purgeImage
     self._manifest.purgeImage(sdUUID, imgUUID, volsImgs, discard)
   File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 259, in
purgeImage
     raise se.ImageDeleteError("%s %s" % (imgUUID, str(e)))
 vdsm.storage.exception.ImageDeleteError: Could not remove all image's volumes:
('fbb11a06-b8ef-4078-9530-978e7ca8ea0b [Errno 39] Directory not empty',)

 But it's unclear what the leftover is 
Maybe be leftover from previous failed LSM. I think we need a better
error message
here, it should list the files in the non-empty directory.

...
 [1]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/127...

 On Mon, Nov 2, 2020 at 1:44 PM Steven Rosenberg <srosenbe(a)redhat.com&gt; wrote:
>
> Dear Benny,
>
> Thank you for your response.
>
> Here is the timeout engine log from one of the ps 45 failures [1].
>
> It seems like this timeout is related to the engine failing and the ost scripts are
not designed to detect the fail, thus timing out:
>
> 2020-11-01 10:31:55,175+01 ERROR
[org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67)
[live_storage_migration] Command id: '60b3f7fc-93db-48b2-82a1-8a93c47e18e1 failed
child command status for step 'MERGE_STATUS'
>
>
>
>
> [1]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/127...
>
> With Best Regards.
>
> Steven.
>
>
>
> On Mon, Nov 2, 2020 at 12:51 PM Benny Zlotnik <bzlotnik(a)redhat.com&gt; wrote:
>>
>> Can you link to the relevant engine/vdsm logs?
>> The timeout in the tests indicates that the desired state wasn't
>> reached so test logs don't provide the information about what exactly
>> happened
>>
>> On Sun, Nov 1, 2020 at 5:10 PM Steven Rosenberg <srosenbe(a)redhat.com&gt;
wrote:
>> >
>> > Dear virt-devel,
>> >
>> > We are currently experiencing many timeout failures in various patch sets
for a gerrit issue 11395 [1].
>> >
>> > The timeouts occur intermittently and seem to be unrelated to the changes
which are only in the 004 module [2] and should have only affected VM1 / Disk1.
>> >
>> > We could use some advice on addressing these issues as well as a review of
the patch to ensure we can move this patch forward. The patch sets and relevant timeouts
are as follows:
>> >
>> > PS 40:
>> >
>> > test_live_storage_migration – test 004 [3]
>> >
>> > PS 41:
>> >
>> > on test_verify_engine_backup – test 002 [4]
>> >
>> > PS 43:
>> >
>> > on test_virtual_machines - test 100 [5]
>> >
>> > PS 45:
>> >
>> >  on test_live_storage_migration – 004 [6]
>> >
>> >
>> >
>> >
>> > [1] https://gerrit.ovirt.org/#/c/111395/
>> > [2] basic-suite-master/test-scenarios/004_basic_sanity.py
>> > [3]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/126...
>> > [4]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/126...
>> > [5]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/127...
>> > [6]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/127...
>> >
>> > With Best Regards.
>> >
>> > Steven
>> >
>> >
>> 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011