virt-sparsify failed (was: [oVirt Jenkins] ovirt-system-tests_basic-suite-master_nightly - Build # 479 - Failure!)

Hi all, On Mon, Oct 12, 2020 at 5:17 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
Above failed with: https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/... vdsm.log has: https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/... 2020-10-11 22:05:14,695-0400 INFO (jsonrpc/1) [api.host] FINISH getJobs return={'jobs': {'05eaea44-7e4c-4442-9926-2bcb696520f1': {'id': '05eaea44-7e4c-4442-9926-2bcb696520f1', 'status': 'failed', 'description': 'sparsify_volume', 'job_type': 'storage', 'error': {'code': 100, 'message': 'General Exception: (\'Command [\\\'/usr/bin/virt-sparsify\\\', \\\'--machine-readable\\\', \\\'--in-place\\\', \\\'/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/8b292c13-fd8a-4a7c-903c-5724ec742c10/images/a367c179-2ac9-4930-abeb-848229f81c97/515fcf06-8743-45d1-9af8-61a0c48e8c67\\\'] failed with rc=1 out=b\\\'3/12\\\\n{ "message": "libguestfs error: guestfs_launch failed.\\\\\\\\nThis usually means the libguestfs appliance failed to start or crashed.\\\\\\\\nDo:\\\\\\\\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\\\\\\\\nand run the command again. For further information, read:\\\\\\\\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\\\\\\\\nYou can also run \\\\\\\'libguestfs-test-tool\\\\\\\' and post the *complete* output\\\\\\\\ninto a bug report or message to the libguestfs mailing list.", "timestamp": "2020-10-11T22:05:08.397538670-04:00", "type": "error" }\\\\n\\\' err=b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\\\\nThis usually means the libguestfs appliance failed to start or crashed.\\\\nDo:\\\\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\\\\nand run the command again. For further information, read:\\\\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\\\\nYou can also run \\\'libguestfs-test-tool\\\' and post the *complete* output\\\\ninto a bug report or message to the libguestfs mailing list.\\\\n\\\\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \\\\ncomplete output:\\\\n\\\\n virt-sparsify -v -x [...]\\\\n"\',)'}}}, 'status': {'code': 0, 'message': 'Done'}} from=::ffff:192.168.201.4,43318, flow_id=365642f4-2fe2-45df-937a-f4ca435eea38 (api:54) 2020-10-11 22:05:14,695-0400 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Return 'Host.getJobs' in bridge with {'05eaea44-7e4c-4442-9926-2bcb696520f1': {'id': '05eaea44-7e4c-4442-9926-2bcb696520f1', 'status': 'failed', 'description': 'sparsify_volume', 'job_type': 'storage', 'error': {'code': 100, 'message': 'General Exception: (\'Command [\\\'/usr/bin/virt-sparsify\\\', \\\'--machine-readable\\\', \\\'--in-place\\\', \\\'/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/8b292c13-fd8a-4a7c-903c-5724ec742c10/images/a367c179-2ac9-4930-abeb-848229f81c97/515fcf06-8743-45d1-9af8-61a0c48e8c67\\\'] failed with rc=1 out=b\\\'3/12\\\\n{ "message": "libguestfs error: guestfs_launch failed.\\\\\\\\nThis usually means the libguestfs appliance failed to start or crashed.\\\\\\\\nDo:\\\\\\\\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\\\\\\\\nand run the command again. For further information, read:\\\\\\\\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\\\\\\\\nYou can also run \\\\\\\'libguestfs-test-tool\\\\\\\' and post the *complete* output\\\\\\\\ninto a bug report or message to the libguestfs mailing list.", "timestamp": "2020-10-11T22:05:08.397538670-04:00", "type": "error" }\\\\n\\\' err=b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\\\\nThis usually means the libguestfs appliance failed to start or crashed.\\\\nDo:\\\\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\\\\nand run the command again. For further information, read:\\\\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\\\\nYou can also run \\\'libguestfs-test-tool\\\' and post the *complete* output\\\\ninto a bug report or message to the libguestfs mailing list.\\\\n\\\\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \\\\ncomplete output:\\\\n\\\\n virt-sparsify -v -x [...]\\\\n"\',)'}}} (__init__:356) /var/log/messages has: Oct 11 22:04:51 lago-basic-suite-master-host-0 kvm[80601]: 1 guest now active Oct 11 22:05:06 lago-basic-suite-master-host-0 journal[80557]: Domain id=1 name='guestfs-hl0ntvn92rtkk2u0' uuid=05ea5a53-562f-49f8-a8ca-76b45c5325b4 is tainted: custom-argv Oct 11 22:05:06 lago-basic-suite-master-host-0 journal[80557]: Domain id=1 name='guestfs-hl0ntvn92rtkk2u0' uuid=05ea5a53-562f-49f8-a8ca-76b45c5325b4 is tainted: host-cpu Oct 11 22:05:06 lago-basic-suite-master-host-0 kvm[80801]: 2 guests now active Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: internal error: End of file from qemu monitor Oct 11 22:05:08 lago-basic-suite-master-host-0 kvm[80807]: 1 guest now active Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: cannot resolve symlink /tmp/libguestfseTG8xF/console.sock: No such file or directoryKN<F3>L^? Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: cannot resolve symlink /tmp/libguestfseTG8xF/guestfsd.sock: No such file or directoryKN<F3>L^? Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: cannot lookup default selinux label for /tmp/libguestfs1WkcF7/overlay1.qcow2 Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: cannot lookup default selinux label for /var/tmp/.guestfs-36/appliance.d/kernel Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: cannot lookup default selinux label for /var/tmp/.guestfs-36/appliance.d/initrd Oct 11 22:05:08 lago-basic-suite-master-host-0 vdsm[74096]: ERROR Job '05eaea44-7e4c-4442-9926-2bcb696520f1' failed#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/vdsm/jobs.py", line 159, in run#012 self._run()#012 File "/usr/lib/python3.6/site-packages/vdsm/storage/sdm/api/sparsify_volume.py", line 57, in _run#012 virtsparsify.sparsify_inplace(self._vol_info.path)#012 File "/usr/lib/python3.6/site-packages/vdsm/virtsparsify.py", line 40, in sparsify_inplace#012 commands.run(cmd)#012 File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, in run#012 raise cmdutils.Error(args, p.returncode, out, err)#012vdsm.common.cmdutils.Error: Command ['/usr/bin/virt-sparsify', '--machine-readable', '--in-place', '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/8b292c13-fd8a-4a7c-903c-5724ec742c10/images/a367c179-2ac9-4930-abeb-848229f81c97/515fcf06-8743-45d1-9af8-61a0c48e8c67'] failed with rc=1 out=b'3/12\n{ "message": "libguestfs error: guestfs_launch failed.\\nThis usually means the libguestfs appliance failed to start or crashed.\\nDo:\\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\\nand run the command again. For further information, read:\\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\\nYou can also run \'libguestfs-test-tool\' and post the *complete* output\\ninto a bug report or message to the libguestfs mailing list.", "timestamp": "2020-10-11T22:05:08.397538670-04:00", "type": "error" }\n' err=b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\nThis usually means the libguestfs appliance failed to start or crashed.\nDo:\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\nand run the command again. For further information, read:\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nYou can also run 'libguestfs-test-tool' and post the *complete* output\ninto a bug report or message to the libguestfs mailing list.\n\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \ncomplete output:\n\n virt-sparsify -v -x [...]\n" The next run of the job (480) did finish successfully. No idea if it was already fixed by a patch, or is simply a random/env issue. Is it possible to pass LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 without patching vdsm? Not sure. In any case, if this does not cause too much excess debug logging, perhaps better always pass it, to help retroactively analyze such failures. Or, patch virt-sparsify/libguestfs/whatever to always log at least enough information on failure even without passing these. Best regards,
Build Number: 479 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #479 [hbraha] network: bond active slave test
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: basic-suite-master.test-scenarios.004_basic_sanity.test_sparsify_disk1
Error Message: AssertionError: False != True after 600 seconds
Stack Trace: api_v4 = <ovirtsdk4.Connection object at 0x7fe717c60e50>
@order_by(_TEST_LIST) def test_sparsify_disk1(api_v4): engine = api_v4.system_service() disk_service = test_utils.get_disk_service(engine, DISK1_NAME) with test_utils.TestEvent(engine, 1325): # USER_SPARSIFY_IMAGE_START event disk_service.sparsify()
with test_utils.TestEvent(engine, 1326): # USER_SPARSIFY_IMAGE_FINISH_SUCCESS
pass
../basic-suite-master/test-scenarios/004_basic_sanity.py:295: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/lib64/python2.7/contextlib.py:24: in __exit__ self.gen.next() ../ost_utils/ost_utils/engine_utils.py:44: in wait_for_event lambda: ../ost_utils/ost_utils/assertions.py:97: in assert_true_within_long assert_equals_within_long(func, True, allowed_exceptions) ../ost_utils/ost_utils/assertions.py:82: in assert_equals_within_long func, value, LONG_TIMEOUT, allowed_exceptions=allowed_exceptions _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
func = <function <lambda> at 0x7fe7176dfb18>, value = True, timeout = 600 allowed_exceptions = [], initial_wait = 0 error_message = 'False != True after 600 seconds'
def assert_equals_within( func, value, timeout, allowed_exceptions=None, initial_wait=10, error_message=None ): allowed_exceptions = allowed_exceptions or [] with _EggTimer(timeout) as timer: while not timer.elapsed(): try: res = func() if res == value: return except Exception as exc: if _instance_of_any(exc, allowed_exceptions): time.sleep(3) continue
LOGGER.exception("Unhandled exception in %s", func) raise
if initial_wait == 0: time.sleep(3) else: time.sleep(initial_wait) initial_wait = 0 try: if error_message is None: error_message = '%s != %s after %s seconds' % (res, value, timeout)
raise AssertionError(error_message)
E AssertionError: False != True after 600 seconds
../ost_utils/ost_utils/assertions.py:60: AssertionError_______________________________________________ Infra mailing list -- infra@ovirt.org To unsubscribe send an email to infra-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/65FECSE7EBWNSH...
-- Didi

On Mon, Oct 12, 2020 at 9:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,
On Mon, Oct 12, 2020 at 5:17 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
Above failed with:
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
vdsm.log has:
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
2020-10-11 22:05:14,695-0400 INFO (jsonrpc/1) [api.host] FINISH getJobs return={'jobs': {'05eaea44-7e4c-4442-9926-2bcb696520f1': {'id': '05eaea44-7e4c-4442-9926-2bcb696520f1', 'status': 'failed', 'description': 'sparsify_volume', 'job_type': 'storage', 'error': {'code': 100, 'message': 'General Exception: (\'Command [\\\'/usr/bin/virt-sparsify\\\', \\\'--machine-readable\\\', \\\'--in-place\\\', \\\'/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/8b292c13-fd8a-4a7c-903c-5724ec742c10/images/a367c179-2ac9-4930-abeb-848229f81c97/515fcf06-8743-45d1-9af8-61a0c48e8c67\\\'] failed with rc=1 out=b\\\'3/12\\\\n{ "message": "libguestfs error: guestfs_launch failed.\\\\\\\\nThis usually means the libguestfs appliance failed to start or crashed.\\\\\\\\nDo:\\\\\\\\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\\\\\\\\nand run the command again. For further information, read:\\\\\\\\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\\\\\\\\nYou can also run \\\\\\\'libguestfs-test-tool\\\\\\\' and post the *complete* output\\\\\\\\ninto a bug report or message to the libguestfs mailing list.", "timestamp": "2020-10-11T22:05:08.397538670-04:00", "type": "error" }\\\\n\\\' err=b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\\\\nThis usually means the libguestfs appliance failed to start or crashed.\\\\nDo:\\\\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\\\\nand run the command again. For further information, read:\\\\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\\\\nYou can also run \\\'libguestfs-test-tool\\\' and post the *complete* output\\\\ninto a bug report or message to the libguestfs mailing list.\\\\n\\\\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \\\\ncomplete output:\\\\n\\\\n virt-sparsify -v -x [...]\\\\n"\',)'}}}, 'status': {'code': 0, 'message': 'Done'}} from=::ffff:192.168.201.4,43318, flow_id=365642f4-2fe2-45df-937a-f4ca435eea38 (api:54) 2020-10-11 22:05:14,695-0400 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Return 'Host.getJobs' in bridge with {'05eaea44-7e4c-4442-9926-2bcb696520f1': {'id': '05eaea44-7e4c-4442-9926-2bcb696520f1', 'status': 'failed', 'description': 'sparsify_volume', 'job_type': 'storage', 'error': {'code': 100, 'message': 'General Exception: (\'Command [\\\'/usr/bin/virt-sparsify\\\', \\\'--machine-readable\\\', \\\'--in-place\\\', \\\'/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/8b292c13-fd8a-4a7c-903c-5724ec742c10/images/a367c179-2ac9-4930-abeb-848229f81c97/515fcf06-8743-45d1-9af8-61a0c48e8c67\\\'] failed with rc=1 out=b\\\'3/12\\\\n{ "message": "libguestfs error: guestfs_launch failed.\\\\\\\\nThis usually means the libguestfs appliance failed to start or crashed.\\\\\\\\nDo:\\\\\\\\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\\\\\\\\nand run the command again. For further information, read:\\\\\\\\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\\\\\\\\nYou can also run \\\\\\\'libguestfs-test-tool\\\\\\\' and post the *complete* output\\\\\\\\ninto a bug report or message to the libguestfs mailing list.", "timestamp": "2020-10-11T22:05:08.397538670-04:00", "type": "error" }\\\\n\\\' err=b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\\\\nThis usually means the libguestfs appliance failed to start or crashed.\\\\nDo:\\\\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\\\\nand run the command again. For further information, read:\\\\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\\\\nYou can also run \\\'libguestfs-test-tool\\\' and post the *complete* output\\\\ninto a bug report or message to the libguestfs mailing list.\\\\n\\\\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \\\\ncomplete output:\\\\n\\\\n virt-sparsify -v -x [...]\\\\n"\',)'}}} (__init__:356)
/var/log/messages has:
Oct 11 22:04:51 lago-basic-suite-master-host-0 kvm[80601]: 1 guest now active Oct 11 22:05:06 lago-basic-suite-master-host-0 journal[80557]: Domain id=1 name='guestfs-hl0ntvn92rtkk2u0' uuid=05ea5a53-562f-49f8-a8ca-76b45c5325b4 is tainted: custom-argv Oct 11 22:05:06 lago-basic-suite-master-host-0 journal[80557]: Domain id=1 name='guestfs-hl0ntvn92rtkk2u0' uuid=05ea5a53-562f-49f8-a8ca-76b45c5325b4 is tainted: host-cpu Oct 11 22:05:06 lago-basic-suite-master-host-0 kvm[80801]: 2 guests now active Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: internal error: End of file from qemu monitor Oct 11 22:05:08 lago-basic-suite-master-host-0 kvm[80807]: 1 guest now active Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: cannot resolve symlink /tmp/libguestfseTG8xF/console.sock: No such file or directoryKN<F3>L^? Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: cannot resolve symlink /tmp/libguestfseTG8xF/guestfsd.sock: No such file or directoryKN<F3>L^? Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: cannot lookup default selinux label for /tmp/libguestfs1WkcF7/overlay1.qcow2 Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: cannot lookup default selinux label for /var/tmp/.guestfs-36/appliance.d/kernel Oct 11 22:05:08 lago-basic-suite-master-host-0 journal[80557]: cannot lookup default selinux label for /var/tmp/.guestfs-36/appliance.d/initrd Oct 11 22:05:08 lago-basic-suite-master-host-0 vdsm[74096]: ERROR Job '05eaea44-7e4c-4442-9926-2bcb696520f1' failed#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/vdsm/jobs.py", line 159, in run#012 self._run()#012 File "/usr/lib/python3.6/site-packages/vdsm/storage/sdm/api/sparsify_volume.py", line 57, in _run#012 virtsparsify.sparsify_inplace(self._vol_info.path)#012 File "/usr/lib/python3.6/site-packages/vdsm/virtsparsify.py", line 40, in sparsify_inplace#012 commands.run(cmd)#012 File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, in run#012 raise cmdutils.Error(args, p.returncode, out, err)#012vdsm.common.cmdutils.Error: Command ['/usr/bin/virt-sparsify', '--machine-readable', '--in-place', '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/8b292c13-fd8a-4a7c-903c-5724ec742c10/images/a367c179-2ac9-4930-abeb-848229f81c97/515fcf06-8743-45d1-9af8-61a0c48e8c67'] failed with rc=1 out=b'3/12\n{ "message": "libguestfs error: guestfs_launch failed.\\nThis usually means the libguestfs appliance failed to start or crashed.\\nDo:\\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\\nand run the command again. For further information, read:\\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\\nYou can also run \'libguestfs-test-tool\' and post the *complete* output\\ninto a bug report or message to the libguestfs mailing list.", "timestamp": "2020-10-11T22:05:08.397538670-04:00", "type": "error" }\n' err=b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\nThis usually means the libguestfs appliance failed to start or crashed.\nDo:\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\nand run the command again. For further information, read:\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nYou can also run 'libguestfs-test-tool' and post the *complete* output\ninto a bug report or message to the libguestfs mailing list.\n\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \ncomplete output:\n\n virt-sparsify -v -x [...]\n"
The next run of the job (480) did finish successfully. No idea if it was already fixed by a patch, or is simply a random/env issue.
I think this is env issue, we run on overloaded vms with small amount of memory. I have seen such radnom failures before.
Is it possible to pass LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 without patching vdsm? Not sure. In any case, if this does not cause too much excess debug logging, perhaps better always pass it, to help retroactively analyze such failures.
Unfortunately using LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 creates huge amount of logs so it is not possible to include these log for every run. Maybe we can show the logs only if the command failed, but vdsm infrastructure make this hard to do since we always log commnad stderr, even if the command was successful. Also I'm not sure this is a good idea to always run virt-sparsify in debug level. I think I posted a patch to run libguestfs-test-tool after virt-something errors for collecting data, but I cannot find it. Maybe this can be useful, so after every virt-something error we will have enough info about the issue in vdsm log.
Or, patch virt-sparsify/libguestfs/whatever to always log at least enough information on failure even without passing these.
I think this is the right solution - when virt-something tool fails, it should log the reason for the failure - the error that caused the tool to fail. I'm not sure this is easy to do as the failing code run inside a special VM. Maybe the code running in the VM should log the output in a machine readable way, so once an error is detected virt-something can report the error as the reason, without running in debug mode. Nir

On Tue, Oct 13, 2020 at 06:45:42PM +0300, Nir Soffer wrote:
I think this is the right solution - when virt-something tool fails, it should log the reason for the failure - the error that caused the tool to fail. I'm not sure this is easy to do as the failing code run inside a special VM. Maybe the code running in the VM should log the output in a machine readable way, so once an error is detected virt-something can report the error as the reason, without running in debug mode.
All the virt-* tools that I've written have a non-zero exit code and print an error message on stderr when they fail. Errors from inside the appliance are propagated to the library and thence to the tool correctly. I think the best thing to do is: - spool up stdout + stderr from the tool - if the exit code != 0, save the spooled output for analysis - if the exit code == 0, discard it (or keep it if you like) Without LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE stdout + stderr should be quite minimal for all tools. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/

On Tue, Oct 13, 2020 at 7:15 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 06:45:42PM +0300, Nir Soffer wrote:
I think this is the right solution - when virt-something tool fails, it should log the reason for the failure - the error that caused the tool to fail. I'm not sure this is easy to do as the failing code run inside a special VM. Maybe the code running in the VM should log the output in a machine readable way, so once an error is detected virt-something can report the error as the reason, without running in debug mode.
All the virt-* tools that I've written have a non-zero exit code and print an error message on stderr when they fail. Errors from inside the appliance are propagated to the library and thence to the tool correctly.
I think the best thing to do is:
- spool up stdout + stderr from the tool
- if the exit code != 0, save the spooled output for analysis
- if the exit code == 0, discard it (or keep it if you like)
This is what we already do, and the result is not helpful. If you look at the log message in the previous message, basically the only info about the error is: libguestfs error: guestfs_launch failed I don't see what we can do with this error message. If this is an error running qemu, the error message from qemu can help to understand why qemu failed to start.
Without LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE stdout + stderr should be quite minimal for all tools.
Rich.
-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/

On Tue, Oct 13, 2020 at 07:56:29PM +0300, Nir Soffer wrote:
On Tue, Oct 13, 2020 at 7:15 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 06:45:42PM +0300, Nir Soffer wrote:
I think this is the right solution - when virt-something tool fails, it should log the reason for the failure - the error that caused the tool to fail. I'm not sure this is easy to do as the failing code run inside a special VM. Maybe the code running in the VM should log the output in a machine readable way, so once an error is detected virt-something can report the error as the reason, without running in debug mode.
All the virt-* tools that I've written have a non-zero exit code and print an error message on stderr when they fail. Errors from inside the appliance are propagated to the library and thence to the tool correctly.
I think the best thing to do is:
- spool up stdout + stderr from the tool
- if the exit code != 0, save the spooled output for analysis
- if the exit code == 0, discard it (or keep it if you like)
This is what we already do, and the result is not helpful. If you look at the log message in the previous message, basically the only info about the error is:
libguestfs error: guestfs_launch failed
I don't see what we can do with this error message.
Right, so in this particular instance the error message would tell us that you should run libguestfs-test-tool because your qemu/kernel/etc is broken in some way :-/ There's not a particularly good answer here if you don't want to ever use LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE, but perhaps you could run libguestfs-test-tool if you see any error which matches the substring /guestfs_launch/ ? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW

On Tue, Oct 13, 2020 at 8:40 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 07:56:29PM +0300, Nir Soffer wrote:
On Tue, Oct 13, 2020 at 7:15 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 06:45:42PM +0300, Nir Soffer wrote:
I think this is the right solution - when virt-something tool fails, it should log the reason for the failure - the error that caused the tool to fail. I'm not sure this is easy to do as the failing code run inside a special VM. Maybe the code running in the VM should log the output in a machine readable way, so once an error is detected virt-something can report the error as the reason, without running in debug mode.
All the virt-* tools that I've written have a non-zero exit code and print an error message on stderr when they fail. Errors from inside the appliance are propagated to the library and thence to the tool correctly.
I think the best thing to do is:
- spool up stdout + stderr from the tool
- if the exit code != 0, save the spooled output for analysis
- if the exit code == 0, discard it (or keep it if you like)
This is what we already do, and the result is not helpful. If you look at the log message in the previous message, basically the only info about the error is:
libguestfs error: guestfs_launch failed
I don't see what we can do with this error message.
Right, so in this particular instance the error message would tell us that you should run libguestfs-test-tool because your qemu/kernel/etc is broken in some way :-/
There's not a particularly good answer here if you don't want to ever use LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE, but perhaps you could run libguestfs-test-tool if you see any error which matches the substring /guestfs_launch/ ?
Another (orthogonal?) option: Make LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE log elsewhere, not to stdout/err (e.g. some other file descriptor, or to a file passed via env or whatever). This way, it might make sense for vdsm to always pass these vars, continue logging all stdout/err, and log/keep debug/trace logs only on errors. Best regards, -- Didi

On Wed, Oct 14, 2020 at 8:54 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Oct 13, 2020 at 8:40 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 07:56:29PM +0300, Nir Soffer wrote:
On Tue, Oct 13, 2020 at 7:15 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 06:45:42PM +0300, Nir Soffer wrote:
I think this is the right solution - when virt-something tool fails, it should log the reason for the failure - the error that caused the tool to fail. I'm not sure this is easy to do as the failing code run inside a special VM. Maybe the code running in the VM should log the output in a machine readable way, so once an error is detected virt-something can report the error as the reason, without running in debug mode.
All the virt-* tools that I've written have a non-zero exit code and print an error message on stderr when they fail. Errors from inside the appliance are propagated to the library and thence to the tool correctly.
I think the best thing to do is:
- spool up stdout + stderr from the tool
- if the exit code != 0, save the spooled output for analysis
- if the exit code == 0, discard it (or keep it if you like)
This is what we already do, and the result is not helpful. If you look at the log message in the previous message, basically the only info about the error is:
libguestfs error: guestfs_launch failed
I don't see what we can do with this error message.
Right, so in this particular instance the error message would tell us that you should run libguestfs-test-tool because your qemu/kernel/etc is broken in some way :-/
There's not a particularly good answer here if you don't want to ever use LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE, but perhaps you could run libguestfs-test-tool if you see any error which matches the substring /guestfs_launch/ ?
Another (orthogonal?) option:
Make LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE log elsewhere, not to stdout/err (e.g. some other file descriptor, or to a file passed via env or whatever). This way, it might make sense for vdsm to always pass these vars, continue logging all stdout/err, and log/keep debug/trace logs only on errors.
This now happened again: https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/... https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/... 2020-11-09 01:05:42,031-0500 INFO (jsonrpc/4) [api.host] FINISH getAllVmIoTunePolicies return={'status': {'code': 0, 'message': 'Done'}, 'io_tune_policies_dict': {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], 'current_values': [{'name': 'vda', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80', 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, 'read_iops_sec': 0}}]}}} from=::1,34002 (api:54) 2020-11-09 01:05:42,038-0500 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Return 'Host.getAllVmIoTunePolicies' in bridge with {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], 'current_values': [{'name': 'vda', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80', 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, 'read_iops_sec': 0}}]}} (__init__:360) 2020-11-09 01:05:42,435-0500 DEBUG (tasks/3) [common.commands] FAILED: <err> = b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\nThis usually means the libguestfs appliance failed to start or crashed.\nDo:\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\nand run the command again. For further information, read:\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nYou can also run 'libguestfs-test-tool' and post the *complete* output\ninto a bug report or message to the libguestfs mailing list.\n\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \ncomplete output:\n\n virt-sparsify -v -x [...]\n"; <rc> = 1 (commands:98) I suggest that if we have come to a dead-end and no-one has any clue, then we either patch something (vdsm?) to allow getting more information if this happens again, or open a bug for further discussion/prioritization. Best regards, -- Didi

Hi, Not sure if this is the same issue or not, but I opened a thread about guestfs_launch failing not too long ago. https://www.redhat.com/archives/libguestfs/2020-August/msg00352.html Since we are using our own tool, we ended up retrying guestfs_launch (and also guestfs_open, since the documentation suggests not reusing handles) - Not sure if this solved the issue since we introduced this retrial not too long ago. Sam On Mon, Nov 9, 2020 at 9:57 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Oct 14, 2020 at 8:54 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Oct 13, 2020 at 8:40 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 07:56:29PM +0300, Nir Soffer wrote:
On Tue, Oct 13, 2020 at 7:15 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 06:45:42PM +0300, Nir Soffer wrote:
I think this is the right solution - when virt-something tool fails, it should log the reason for the failure - the error that caused the tool to fail. I'm not sure this is easy to do as the failing code run inside a special VM. Maybe the code running in the VM should log the output in a machine readable way, so once an error is detected virt-something can report the error as the reason, without running in debug mode.
All the virt-* tools that I've written have a non-zero exit code and print an error message on stderr when they fail. Errors from inside the appliance are propagated to the library and thence to the tool correctly.
I think the best thing to do is:
- spool up stdout + stderr from the tool
- if the exit code != 0, save the spooled output for analysis
- if the exit code == 0, discard it (or keep it if you like)
This is what we already do, and the result is not helpful. If you look at the log message in the previous message, basically the only info about the error is:
libguestfs error: guestfs_launch failed
I don't see what we can do with this error message.
Right, so in this particular instance the error message would tell us that you should run libguestfs-test-tool because your qemu/kernel/etc is broken in some way :-/
There's not a particularly good answer here if you don't want to ever use LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE, but perhaps you could run libguestfs-test-tool if you see any error which matches the substring /guestfs_launch/ ?
Another (orthogonal?) option:
Make LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE log elsewhere, not to stdout/err (e.g. some other file descriptor, or to a file passed via env or whatever). This way, it might make sense for vdsm to always pass these vars, continue logging all stdout/err, and log/keep debug/trace logs only on errors.
This now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
2020-11-09 01:05:42,031-0500 INFO (jsonrpc/4) [api.host] FINISH getAllVmIoTunePolicies return={'status': {'code': 0, 'message': 'Done'}, 'io_tune_policies_dict': {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], 'current_values': [{'name': 'vda', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80', 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, 'read_iops_sec': 0}}]}}} from=::1,34002 (api:54) 2020-11-09 01:05:42,038-0500 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Return 'Host.getAllVmIoTunePolicies' in bridge with {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], 'current_values': [{'name': 'vda', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80', 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, 'read_iops_sec': 0}}]}} (__init__:360) 2020-11-09 01:05:42,435-0500 DEBUG (tasks/3) [common.commands] FAILED: <err> = b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\nThis usually means the libguestfs appliance failed to start or crashed.\nDo:\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\nand run the command again. For further information, read:\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nYou can also run 'libguestfs-test-tool' and post the *complete* output\ninto a bug report or message to the libguestfs mailing list.\n\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \ncomplete output:\n\n virt-sparsify -v -x [...]\n"; <rc> = 1 (commands:98)
I suggest that if we have come to a dead-end and no-one has any clue, then we either patch something (vdsm?) to allow getting more information if this happens again, or open a bug for further discussion/prioritization.
Best regards, -- Didi
_______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://www.redhat.com/mailman/listinfo/libguestfs

On Mon, Nov 09, 2020 at 09:16:33AM +0200, Yedidyah Bar David wrote:
On Wed, Oct 14, 2020 at 8:54 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Oct 13, 2020 at 8:40 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 07:56:29PM +0300, Nir Soffer wrote:
On Tue, Oct 13, 2020 at 7:15 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 06:45:42PM +0300, Nir Soffer wrote:
I think this is the right solution - when virt-something tool fails, it should log the reason for the failure - the error that caused the tool to fail. I'm not sure this is easy to do as the failing code run inside a special VM. Maybe the code running in the VM should log the output in a machine readable way, so once an error is detected virt-something can report the error as the reason, without running in debug mode.
All the virt-* tools that I've written have a non-zero exit code and print an error message on stderr when they fail. Errors from inside the appliance are propagated to the library and thence to the tool correctly.
I think the best thing to do is:
- spool up stdout + stderr from the tool
- if the exit code != 0, save the spooled output for analysis
- if the exit code == 0, discard it (or keep it if you like)
This is what we already do, and the result is not helpful. If you look at the log message in the previous message, basically the only info about the error is:
libguestfs error: guestfs_launch failed
I don't see what we can do with this error message.
Right, so in this particular instance the error message would tell us that you should run libguestfs-test-tool because your qemu/kernel/etc is broken in some way :-/
There's not a particularly good answer here if you don't want to ever use LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE, but perhaps you could run libguestfs-test-tool if you see any error which matches the substring /guestfs_launch/ ?
Another (orthogonal?) option:
Make LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE log elsewhere, not to stdout/err (e.g. some other file descriptor, or to a file passed via env or whatever). This way, it might make sense for vdsm to always pass these vars, continue logging all stdout/err, and log/keep debug/trace logs only on errors.
This now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
2020-11-09 01:05:42,031-0500 INFO (jsonrpc/4) [api.host] FINISH getAllVmIoTunePolicies return={'status': {'code': 0, 'message': 'Done'}, 'io_tune_policies_dict': {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], 'current_values': [{'name': 'vda', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80', 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, 'read_iops_sec': 0}}]}}} from=::1,34002 (api:54) 2020-11-09 01:05:42,038-0500 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Return 'Host.getAllVmIoTunePolicies' in bridge with {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], 'current_values': [{'name': 'vda', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80', 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, 'read_iops_sec': 0}}]}} (__init__:360) 2020-11-09 01:05:42,435-0500 DEBUG (tasks/3) [common.commands] FAILED: <err> = b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\nThis usually means the libguestfs appliance failed to start or crashed.\nDo:\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\nand run the command again. For further information, read:\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nYou can also run 'libguestfs-test-tool' and post the *complete* output\ninto a bug report or message to the libguestfs mailing list.\n\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \ncomplete output:\n\n virt-sparsify -v -x [...]\n"; <rc> = 1 (commands:98)
I suggest that if we have come to a dead-end and no-one has any clue, then we either patch something (vdsm?) to allow getting more information if this happens again, or open a bug for further discussion/prioritization.
You definitely need to run libguestfs-test-tool from vdsm if we're going to get any further with this. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v

On Mon, Nov 9, 2020 at 11:03 AM Richard W.M. Jones <rjones@redhat.com> wrote:
On Mon, Nov 09, 2020 at 09:16:33AM +0200, Yedidyah Bar David wrote:
On Wed, Oct 14, 2020 at 8:54 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Oct 13, 2020 at 8:40 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 07:56:29PM +0300, Nir Soffer wrote:
On Tue, Oct 13, 2020 at 7:15 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Tue, Oct 13, 2020 at 06:45:42PM +0300, Nir Soffer wrote: > I think this is the right solution - when virt-something tool fails, > it should log the reason for the failure - the error that caused the > tool to fail. I'm not sure this is easy to do as the failing code > run inside a special VM. Maybe the code running in the VM should log > the output in a machine readable way, so once an error is detected > virt-something can report the error as the reason, without running > in debug mode.
All the virt-* tools that I've written have a non-zero exit code and print an error message on stderr when they fail. Errors from inside the appliance are propagated to the library and thence to the tool correctly.
I think the best thing to do is:
- spool up stdout + stderr from the tool
- if the exit code != 0, save the spooled output for analysis
- if the exit code == 0, discard it (or keep it if you like)
This is what we already do, and the result is not helpful. If you look at the log message in the previous message, basically the only info about the error is:
libguestfs error: guestfs_launch failed
I don't see what we can do with this error message.
Right, so in this particular instance the error message would tell us that you should run libguestfs-test-tool because your qemu/kernel/etc is broken in some way :-/
There's not a particularly good answer here if you don't want to ever use LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE, but perhaps you could run libguestfs-test-tool if you see any error which matches the substring /guestfs_launch/ ?
Another (orthogonal?) option:
Make LIBGUESTFS_DEBUG/LIBGUESTFS_TRACE log elsewhere, not to stdout/err (e.g. some other file descriptor, or to a file passed via env or whatever). This way, it might make sense for vdsm to always pass these vars, continue logging all stdout/err, and log/keep debug/trace logs only on errors.
This now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
2020-11-09 01:05:42,031-0500 INFO (jsonrpc/4) [api.host] FINISH getAllVmIoTunePolicies return={'status': {'code': 0, 'message': 'Done'}, 'io_tune_policies_dict': {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], 'current_values': [{'name': 'vda', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80', 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, 'read_iops_sec': 0}}]}}} from=::1,34002 (api:54) 2020-11-09 01:05:42,038-0500 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Return 'Host.getAllVmIoTunePolicies' in bridge with {'c189ecb3-8f2e-4726-8766-7d2d9b514687': {'policy': [], 'current_values': [{'name': 'vda', 'path': '/rhev/data-center/mnt/192.168.200.4:_exports_nfs_share1/1d093232-d41e-483f-a915-62f8db3c972f/images/e7ee6417-b319-4d84-81a5-5d77cbce2385/710d2c10-e6b7-4d16-bd37-50a9d4e14a80', 'ioTune': {'total_bytes_sec': 0, 'read_bytes_sec': 0, 'write_bytes_sec': 0, 'total_iops_sec': 0, 'write_iops_sec': 0, 'read_iops_sec': 0}}]}} (__init__:360) 2020-11-09 01:05:42,435-0500 DEBUG (tasks/3) [common.commands] FAILED: <err> = b"virt-sparsify: error: libguestfs error: guestfs_launch failed.\nThis usually means the libguestfs appliance failed to start or crashed.\nDo:\n export LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1\nand run the command again. For further information, read:\n http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nYou can also run 'libguestfs-test-tool' and post the *complete* output\ninto a bug report or message to the libguestfs mailing list.\n\nIf reporting bugs, run virt-sparsify with debugging enabled and include the \ncomplete output:\n\n virt-sparsify -v -x [...]\n"; <rc> = 1 (commands:98)
I suggest that if we have come to a dead-end and no-one has any clue, then we either patch something (vdsm?) to allow getting more information if this happens again, or open a bug for further discussion/prioritization.
You definitely need to run libguestfs-test-tool from vdsm if we're going to get any further with this.
Now filed [1] on vdsm. Feel free to comment there about what else we should do to get more information. Thanks! [1] https://bugzilla.redhat.com/show_bug.cgi?id=1895843 Best regards,
Rich.
-- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
-- Didi

On Tue, Oct 13, 2020 at 6:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Oct 12, 2020 at 9:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
The next run of the job (480) did finish successfully. No idea if it was already fixed by a patch, or is simply a random/env issue.
I think this is env issue, we run on overloaded vms with small amount of memory. I have seen such radnom failures before.
Generally speaking, I think we must aim for zero failures due to "env issues" - and not ignore them as such. It would obviously be nice if we had more hardware in CI, no doubt. But I wonder if perhaps stressing the system like we do (due to resources scarcity) is actually a good thing - that it helps us find bugs that real users might also run into in actually legitimate scenarios - meaning, using what we recommend in terms of hardware etc. but with a load that is higher than what we have in CI per-run - as, admittedly, we only have minimal _data_ there. So: If we decide that some code "worked as designed" and failed due to "env issue", I still think we should fix this - either in our code, or in CI. For latter, I do not think it makes sense to just say "the machines are overloaded and not have enough memory" - we must come up with concrete details - e.g. "We need at least X MiB RAM". For current issue, if we are certain that this is due to low mem, it's quite easy to e.g. revert this patch: https://gerrit.ovirt.org/110530 Obviously it will mean either longer queues or over-committing (higher load). Not sure which. But personally, I wouldn't do that without knowing more (e.g. following the other thread). Best regards, -- Didi

On 14 Oct 2020, at 08:14, Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Oct 13, 2020 at 6:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Oct 12, 2020 at 9:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
The next run of the job (480) did finish successfully. No idea if it was already fixed by a patch, or is simply a random/env issue.
I think this is env issue, we run on overloaded vms with small amount of memory. I have seen such radnom failures before.
Generally speaking, I think we must aim for zero failures due to "env issues" - and not ignore them as such.
Exactly. We cannot ignore that any longer.
It would obviously be nice if we had more hardware in CI, no doubt.
there’s never enough
But I wonder if perhaps stressing the system like we do (due to resources scarcity) is actually a good thing - that it helps us find bugs that real users might also run into in actually legitimate scenarios
yes, it absolutely does.
- meaning, using what we recommend in terms of hardware etc. but with a load that is higher than what we have in CI per-run - as, admittedly, we only have minimal _data_ there.
So: If we decide that some code "worked as designed" and failed due to "env issue", I still think we should fix this - either in our code, or in CI.
yes!
For latter, I do not think it makes sense to just say "the machines are overloaded and not have enough memory" - we must come up with concrete details - e.g. "We need at least X MiB RAM".
I’ve spent quite some time analyzing the flakes in basic suite this past half year…so allow me to say that that’s usually just an excuse for a lousy test (or functionality:)
For current issue, if we are certain that this is due to low mem, it's quite easy to e.g. revert this patch:
https://gerrit.ovirt.org/110530
Obviously it will mean either longer queues or over-committing (higher load). Not sure which.
it’s difficult to pinpoint the reason really. If it’s happening rarely (as this one is) you’d need a statistically relevant comparison. Which takes time… About this specific sparsify test - it was me uncommenting it few months ago, after running around 100 tests over a weekend. It may have failed once (there were/are still some other flakes)…but to me considering the overall success rate being quite low at that time it sounded acceptable. If this is now happening more often then it does sound like a regression somewhere. Could be all the OST changes or tests rearrangements, but it also could be a code regression. Either way it’s supposed to be predictable. And it is, just not in this environment we use for this particular job - it’s the old one without ost-images, inside the troublesome mock, so you don’t know what it picked up, what’s the really underlying system(outside of mock) Thanks, michal
But personally, I wouldn't do that without knowing more (e.g. following the other thread).
Best regards, -- Didi

On Thu, Oct 15, 2020 at 12:44 PM Michal Skrivanek <mskrivan@redhat.com> wrote:
On 14 Oct 2020, at 08:14, Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Oct 13, 2020 at 6:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Oct 12, 2020 at 9:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
The next run of the job (480) did finish successfully. No idea if it was already fixed by a patch, or is simply a random/env issue.
I think this is env issue, we run on overloaded vms with small amount of memory. I have seen such radnom failures before.
Generally speaking, I think we must aim for zero failures due to "env issues" - and not ignore them as such.
Exactly. We cannot ignore that any longer.
It would obviously be nice if we had more hardware in CI, no doubt.
there’s never enough
But I wonder if perhaps stressing the system like we do (due to resources scarcity) is actually a good thing - that it helps us find bugs that real users might also run into in actually legitimate scenarios
yes, it absolutely does.
- meaning, using what we recommend in terms of hardware etc. but with a load that is higher than what we have in CI per-run - as, admittedly, we only have minimal _data_ there.
So: If we decide that some code "worked as designed" and failed due to "env issue", I still think we should fix this - either in our code, or in CI.
yes!
Or, as applicable e.g. to current case, if we can't reproduce, at least add more information so that a next (random) reproduction reveals more information...
For latter, I do not think it makes sense to just say "the machines are overloaded and not have enough memory" - we must come up with concrete details - e.g. "We need at least X MiB RAM".
I’ve spent quite some time analyzing the flakes in basic suite this past half year…so allow me to say that that’s usually just an excuse for a lousy test (or functionality:)
For current issue, if we are certain that this is due to low mem, it's quite easy to e.g. revert this patch:
https://gerrit.ovirt.org/110530
Obviously it will mean either longer queues or over-committing (higher load). Not sure which.
it’s difficult to pinpoint the reason really. If it’s happening rarely (as this one is) you’d need a statistically relevant comparison. Which takes time…
About this specific sparsify test - it was me uncommenting it few months ago, after running around 100 tests over a weekend. It may have failed once (there were/are still some other flakes)…but to me considering the overall success rate being quite low at that time it sounded acceptable.
... E.g. in current case (I wasn't aware of above), if it fails for you even once, and you can't find the root cause, perhaps better make sure to log more information, so that a next case will be more likely to help us find the root cause. Or open a bug for that, if you do not do this immediately.
If this is now happening more often then it does sound like a regression somewhere. Could be all the OST changes or tests rearrangements, but it also could be a code regression.
I have no idea if it happens more often. I think I only noticed this once.
Either way it’s supposed to be predictable.
Really? The failure of virt-sparsify? So perhaps reply on the other thread explaining how to reproduce, or even fix :-)
And it is, just not in this environment we use for this particular job - it’s the old one without ost-images, inside the troublesome mock, so you don’t know what it picked up, what’s the really underlying system(outside of mock)
Thanks, michal
But personally, I wouldn't do that without knowing more (e.g. following the other thread).
Best regards, -- Didi
-- Didi

On 15 Oct 2020, at 12:16, Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Oct 15, 2020 at 12:44 PM Michal Skrivanek <mskrivan@redhat.com> wrote:
On 14 Oct 2020, at 08:14, Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Oct 13, 2020 at 6:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Oct 12, 2020 at 9:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
The next run of the job (480) did finish successfully. No idea if it was already fixed by a patch, or is simply a random/env issue.
I think this is env issue, we run on overloaded vms with small amount of memory. I have seen such radnom failures before.
Generally speaking, I think we must aim for zero failures due to "env issues" - and not ignore them as such.
Exactly. We cannot ignore that any longer.
It would obviously be nice if we had more hardware in CI, no doubt.
there’s never enough
But I wonder if perhaps stressing the system like we do (due to resources scarcity) is actually a good thing - that it helps us find bugs that real users might also run into in actually legitimate scenarios
yes, it absolutely does.
- meaning, using what we recommend in terms of hardware etc. but with a load that is higher than what we have in CI per-run - as, admittedly, we only have minimal _data_ there.
So: If we decide that some code "worked as designed" and failed due to "env issue", I still think we should fix this - either in our code, or in CI.
yes!
Or, as applicable e.g. to current case, if we can't reproduce, at least add more information so that a next (random) reproduction reveals more information...
For latter, I do not think it makes sense to just say "the machines are overloaded and not have enough memory" - we must come up with concrete details - e.g. "We need at least X MiB RAM".
I’ve spent quite some time analyzing the flakes in basic suite this past half year…so allow me to say that that’s usually just an excuse for a lousy test (or functionality:)
For current issue, if we are certain that this is due to low mem, it's quite easy to e.g. revert this patch:
https://gerrit.ovirt.org/110530
Obviously it will mean either longer queues or over-committing (higher load). Not sure which.
it’s difficult to pinpoint the reason really. If it’s happening rarely (as this one is) you’d need a statistically relevant comparison. Which takes time…
About this specific sparsify test - it was me uncommenting it few months ago, after running around 100 tests over a weekend. It may have failed once (there were/are still some other flakes)…but to me considering the overall success rate being quite low at that time it sounded acceptable.
... E.g. in current case (I wasn't aware of above), if it fails for you even once, and you can't find the root cause, perhaps better make sure to log more information, so that a next case will be more likely to help us find the root cause. Or open a bug for that, if you do not do this immediately.
If this is now happening more often then it does sound like a regression somewhere. Could be all the OST changes or tests rearrangements, but it also could be a code regression.
I have no idea if it happens more often. I think I only noticed this once.
Either way it’s supposed to be predictable.
Really? The failure of virt-sparsify? So perhaps reply on the other thread explaining how to reproduce, or even fix :-)
sadly, no:) the environment is. And that’s why we’re moving towards that. ost images gives a complete software isolation (all repos are disabled), the only thing it still does over network is to download the relatively small cirros image from glance.ovirt.org. And running OST on baremetal gives you the isolation from possible concurrent runs in CI env. as for a followup, it’s a virt test, adding Arik
And it is, just not in this environment we use for this particular job - it’s the old one without ost-images, inside the troublesome mock, so you don’t know what it picked up, what’s the really underlying system(outside of mock)
Thanks, michal
But personally, I wouldn't do that without knowing more (e.g. following the other thread).
Best regards, -- Didi
-- Didi

On Thu, Oct 15, 2020 at 2:00 PM Michal Skrivanek <mskrivan@redhat.com> wrote:
On 15 Oct 2020, at 12:16, Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Oct 15, 2020 at 12:44 PM Michal Skrivanek <mskrivan@redhat.com> wrote:
On 14 Oct 2020, at 08:14, Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Oct 13, 2020 at 6:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Oct 12, 2020 at 9:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
The next run of the job (480) did finish successfully. No idea if it was already fixed by a patch, or is simply a random/env issue.
I think this is env issue, we run on overloaded vms with small amount of memory. I have seen such radnom failures before.
Generally speaking, I think we must aim for zero failures due to "env issues" - and not ignore them as such.
Exactly. We cannot ignore that any longer.
It would obviously be nice if we had more hardware in CI, no doubt.
there’s never enough
But I wonder if perhaps stressing the system like we do (due to resources scarcity) is actually a good thing - that it helps us find bugs that real users might also run into in actually legitimate scenarios
yes, it absolutely does.
- meaning, using what we recommend in terms of hardware etc. but with a load that is higher than what we have in CI per-run - as, admittedly, we only have minimal _data_ there.
So: If we decide that some code "worked as designed" and failed due to "env issue", I still think we should fix this - either in our code, or in CI.
yes!
Or, as applicable e.g. to current case, if we can't reproduce, at least add more information so that a next (random) reproduction reveals more information...
For latter, I do not think it makes sense to just say "the machines are overloaded and not have enough memory" - we must come up with concrete details - e.g. "We need at least X MiB RAM".
I’ve spent quite some time analyzing the flakes in basic suite this past half year…so allow me to say that that’s usually just an excuse for a lousy test (or functionality:)
For current issue, if we are certain that this is due to low mem, it's quite easy to e.g. revert this patch:
https://gerrit.ovirt.org/110530
Obviously it will mean either longer queues or over-committing (higher load). Not sure which.
it’s difficult to pinpoint the reason really. If it’s happening rarely (as this one is) you’d need a statistically relevant comparison. Which takes time…
About this specific sparsify test - it was me uncommenting it few months ago, after running around 100 tests over a weekend. It may have failed once (there were/are still some other flakes)…but to me considering the overall success rate being quite low at that time it sounded acceptable.
... E.g. in current case (I wasn't aware of above), if it fails for you even once, and you can't find the root cause, perhaps better make sure to log more information, so that a next case will be more likely to help us find the root cause. Or open a bug for that, if you do not do this immediately.
If this is now happening more often then it does sound like a regression somewhere. Could be all the OST changes or tests rearrangements, but it also could be a code regression.
I have no idea if it happens more often. I think I only noticed this once.
Either way it’s supposed to be predictable.
Really? The failure of virt-sparsify? So perhaps reply on the other thread explaining how to reproduce, or even fix :-)
sadly, no:) the environment is. And that’s why we’re moving towards that. ost images gives a complete software isolation (all repos are disabled), the only thing it still does over network is to download the relatively small cirros image from glance.ovirt.org. And running OST on baremetal gives you the isolation from possible concurrent runs in CI env.
Just having the software well-controlled does not mean that everything is predictable. Bugs can also be due to timing, randomness, etc.
as for a followup, it’s a virt test, adding Arik
Nir is already corresponding mainly with Daniel from virt team, on the other thread - better continue the discussion there, imo. Best regards,
And it is, just not in this environment we use for this particular job - it’s the old one without ost-images, inside the troublesome mock, so you don’t know what it picked up, what’s the really underlying system(outside of mock)
Thanks, michal
But personally, I wouldn't do that without knowing more (e.g. following the other thread).
Best regards, -- Didi
-- Didi
-- Didi
participants (5)
-
Michal Skrivanek
-
Nir Soffer
-
Richard W.M. Jones
-
Sam Eiderman
-
Yedidyah Bar David