On Sun, Mar 18, 2018 at 4:21 PM Yedidyah Bar David <didi(a)redhat.com> wrote:
>
> On Sun, Mar 18, 2018 at 2:48 PM, Yedidyah Bar David <didi(a)redhat.com>
> wrote:
> > On Sun, Mar 18, 2018 at 1:45 PM, Yedidyah Bar David <didi(a)redhat.com>
> > wrote:
> >> On Sun, Mar 18, 2018 at 11:20 AM, <jenkins(a)jenkins.phx.ovirt.org>
> >> wrote:
> >>> Project:
> >>>
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.1/
> >>> Build:
> >>>
http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.1/223/
> >>> Build Number: 223
> >>> Build Status: Still Failing
> >>> Triggered By: Started by timer
> >>
> >> It was broken by:
> >>
> >> [1]
https://gerrit.ovirt.org/88483
> >>
> >> It should be fixed by:
> >>
> >> [2]
https://bugzilla.redhat.com/show_bug.cgi?id=1554283
> >>
> >> [2] is on modified, no idea about its status other than that.
> >>
> >> I didn't intend to merge [1] before [2] is fixed, not sure if
> >> Sandro didn't notice my comment there, or thought that [2] is fixed.
> >>
> >> If it's annoying we can revert [1] and re-merge when [2] is fixed.
> >
> > Talked with Gal, and pushed this, which should hopefully fix:
> >
> >
https://gerrit.ovirt.org/89136
>
> It indeed seems to fix [1][2]:
>
> 13:13:24 # he_get_shared_config:
> 13:13:26 # he_get_shared_config: Success (in 0:00:01)
> 13:13:26 # sleep:
> 13:15:26 # sleep: Success (in 0:02:00)
> 13:15:26 # add_he_hosts:
> 13:16:18 # add_he_hosts: Success (in 0:00:52)
> 13:16:18 # he_check_ha_agent:
> 13:16:19 # he_check_ha_agent: Success (in 0:00:00)
>
> But later fails:
>
> 13:16:20 # add_secondary_storage_domains:
> 13:19:30 Error while running thread
> 13:19:30 Traceback (most recent call last):
> 13:19:30 File "/usr/lib/python2.7/site-packages/lago/utils.py", line
> 58, in _ret_via_queue
> 13:19:30 queue.put({'return': func()})
> 13:19:30 File
>
>
"/home/jenkins/workspace/ovirt-system-tests_master_check-patch-el7-x86_64/ovirt-system-tests/he-basic-suite-4.1/test-scenarios/002_bootstrap.py",
> line 491, in add_nfs_storage_domain
> 13:19:30 add_generic_nfs_storage_domain(prefix, SD_NFS_NAME,
> SD_NFS_HOST_NAME, SD_NFS_PATH)
> 13:19:30 File
>
>
"/home/jenkins/workspace/ovirt-system-tests_master_check-patch-el7-x86_64/ovirt-system-tests/he-basic-suite-4.1/test-scenarios/002_bootstrap.py",
> line 496, in add_generic_nfs_storage_domain
> 13:19:30 add_generic_nfs_storage_domain_4(prefix, sd_nfs_name,
> nfs_host_name, mount_path, sd_format, sd_type, nfs_version)
> 13:19:30 File
>
>
"/home/jenkins/workspace/ovirt-system-tests_master_check-patch-el7-x86_64/ovirt-system-tests/he-basic-suite-4.1/test-scenarios/002_bootstrap.py",
> line 552, in add_generic_nfs_storage_domain_4
> 13:19:30 _add_storage_domain_4(api, p)
> 13:19:30 File
>
>
"/home/jenkins/workspace/ovirt-system-tests_master_check-patch-el7-x86_64/ovirt-system-tests/he-basic-suite-4.1/test-scenarios/002_bootstrap.py",
> line 466, in _add_storage_domain_4
> 13:19:30 id=sd.id,
> 13:19:30 File
> "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 2219,
> in add
> 13:19:30 return self._internal_add(storage_domain, headers, query,
> wait)
> 13:19:30 File
> "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 223,
> in _internal_add
> 13:19:30 return future.wait() if wait else future
> 13:19:30 File
> "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 53, in
> wait
> 13:19:30 return self._code(response)
> 13:19:30 File
> "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 220,
> in callback
> 13:19:30 self._check_fault(response)
> 13:19:30 File
> "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 123,
> in _check_fault
> 13:19:30 self._raise_error(response, body)
> 13:19:30 File
> "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 109,
> in _raise_error
> 13:19:30 raise error
> 13:19:30 Error: Fault reason is "Operation Failed". Fault detail is
> "[Storage domain cannot be reached. Please ensure it is accessible
> from the host(s).]". HTTP response code is 400.
> 13:19:30 Error while running thread
>
> I'll retrigger now, but perhaps someone from storage wants to check.
> vdsm log has[3]:
>
> 2018-03-18 08:59:52,918-0400 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer]
> RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:539)
> 2018-03-18 08:59:52,924-0400 INFO (jsonrpc/4) [vdsm.api] START
> prepareImage(sdUUID=u'424e809d-b7ad-4ed5-b6a1-5426d373a5d2',
> spUUID=u'8509c64f-cdd7-4713-9e09-a79d90ba26ed',
> imgUUID=u'718090ba-b36e-45cc-bcd6-597c16a766b9',
> leafUUID=u'23ee843b-20e0-4afe-98dc-e165334ac710', allowIllegal=False)
> from=::1,36020, task_id=92aeabf2-084a-4c98-a6e5-03efae38e7b3 (api:46)
> 2018-03-18 08:59:52,928-0400 INFO (jsonrpc/4) [vdsm.api] FINISH
> prepareImage error=Volume does not exist:
> (u'23ee843b-20e0-4afe-98dc-e165334ac710',) from=::1,36020,
> task_id=92aeabf2-084a-4c98-a6e5-03efae38e7b3 (api:50)
> 2018-03-18 08:59:52,928-0400 ERROR (jsonrpc/4)
> [storage.TaskManager.Task]
> (Task='92aeabf2-084a-4c98-a6e5-03efae38e7b3') Unexpected error
> (task:872)
> Traceback (most recent call last):
> File "/usr/share/vdsm/storage/task.py", line 879, in _run
> return fn(*args, **kargs)
> File "<string>", line 2, in prepareImage
> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in
> method
> ret = func(*args, **kwargs)
> File "/usr/share/vdsm/storage/hsm.py", line 3137, in prepareImage
> raise se.VolumeDoesNotExist(leafUUID)
> VolumeDoesNotExist: Volume does not exist:
> (u'23ee843b-20e0-4afe-98dc-e165334ac710',)
Immediately after this failed prepare, we see:
2018-03-18 08:59:52,941-0400 INFO (jsonrpc/7) [vdsm.api] START
createVolume(sdUUID=u'424e809d-b7ad-4ed5-b6a1-5426d373a5d2',
spUUID=u'8509c64f-cdd7-4713-9e09-a79d90ba26ed',
imgUUID=u'718090ba-b36e-45cc-bcd6-597c16a766b9', size=u'1048576',
volFormat=5, preallocate=1, diskType=2,
volUUID=u'23ee843b-20e0-4afe-98dc-e165334ac710',
desc=u'hosted-engine.lockspace',
srcImgUUID=u'00000000-0000-0000-0000-000000000000',
srcVolUUID=u'00000000-0000-0000-0000-000000000000', initialSize=None)
from=::1,36020, task_id=c4c73314-ee10-419a-8265-b8b87763e99f (api:46)
So that volume really did not exit when it was prepared, which means it is a
client
error, not vdsm error.
The caller (hosted engine agent?)
The caller is ovirt-system-tests. The code for
add_secondary_storage_domains is duplicated among most of the suites,
didn't check variations. See e.g.:
It's not calling vdsm directly, but uses the engine api (through the
python sdk). Not sure if it's, as you say, "a bug in the client" (the
engine), or we use the sdk improperly, or something else. Can someone
please advice? Patches to ovirt-system-tests are welcome :-),
especially patches to stress-test this, as it seems it's a sensitive
timing issue - usually does not fail.
The specific suite that failed above is he-basic-suite-4.1 . I think
that the code there is identical with basic-suite-4.1. Both are
different from basic-suite-master. No idea if the diff is due to diff
between the versions, or it's simply a bug fixed in the master suite
and not fixed in 4.1 suites.
should use Volume.getInfo to check if a
volume
exits. Non existing volume is considered expected error in this call and
will be
logged as with INFO log level.
See
https://github.com/oVirt/vdsm/blob/f286cf7825578d8997ca6abc0801a823dee247...
>
> [1]
>
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8...
> [2]
>
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8...
> [3]
>
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8...
>
> >
> >>
> >>>
> >>> -------------------------------------
> >>> Changes Since Last Success:
> >>> -------------------------------------
> >>> Changes for Build #218
> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >>>
> >>>
> >>> Changes for Build #219
> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >>>
> >>>
> >>> Changes for Build #220
> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >>>
> >>>
> >>> Changes for Build #221
> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >>>
> >>>
> >>> Changes for Build #222
> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >>>
> >>>
> >>> Changes for Build #223
> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >>>
> >>>
> >>>
> >>>
> >>> -----------------
> >>> Failed Tests:
> >>> -----------------
> >>> 1 tests failed.
> >>> FAILED: 002_bootstrap.he_check_ha_agent
> >>>
> >>> Error Message:
> >>> False is not true
> >>> -------------------- >> begin captured logging <<
--------------------
> >>> lago.ssh: DEBUG: start task:cb75d7f3-e333-47f1-b3e9-76ff9c0b4b21:Get
> >>> ssh client for lago-he-basic-suite-4-1-host0:
> >>> lago.ssh: DEBUG: end task:cb75d7f3-e333-47f1-b3e9-76ff9c0b4b21:Get ssh
> >>> client for lago-he-basic-suite-4-1-host0:
> >>> lago.ssh: DEBUG: Running 7e67a956 on lago-he-basic-suite-4-1-host0:
> >>> grep Permission denied /var/log/ovirt-hosted-engine-ha/agent.log
> >>> lago.ssh: DEBUG: Command 7e67a956 on lago-he-basic-suite-4-1-host0
> >>> returned with 2
> >>> lago.ssh: DEBUG: Command 7e67a956 on lago-he-basic-suite-4-1-host0
> >>> output:
> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >>>
> >>> lago.ssh: DEBUG: Command 7e67a956 on lago-he-basic-suite-4-1-host0
> >>> errors:
> >>> grep: denied: No such file or directory
> >>>
> >>> --------------------- >> end captured logging <<
---------------------
> >>>
> >>> Stack Trace:
> >>> File "/usr/lib64/python2.7/unittest/case.py", line 369, in
run
> >>> testMethod()
> >>> File "/usr/lib/python2.7/site-packages/nose/case.py", line
197, in
> >>> runTest
> >>> self.test(*self.arg)
> >>> File
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
> >>> 129, in wrapped_test
> >>> test()
> >>> File
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
> >>> 59, in wrapper
> >>> return func(get_test_prefix(), *args, **kwargs)
> >>> File
> >>>
"/home/jenkins/workspace/ovirt-system-tests_he-basic-suite-4.1/ovirt-system-tests/he-basic-suite-4.1/test-scenarios/002_bootstrap.py",
> >>> line 1036, in he_check_ha_agent
> >>> 'Permission denied' not in result.out
> >>> File "/usr/lib64/python2.7/unittest/case.py", line 462, in
> >>> assertTrue
> >>> raise self.failureException(msg)
> >>> "False is not true\n-------------------- >> begin captured
logging <<
> >>> --------------------\nlago.ssh: DEBUG: start
> >>> task:cb75d7f3-e333-47f1-b3e9-76ff9c0b4b21:Get ssh client for
> >>> lago-he-basic-suite-4-1-host0:\nlago.ssh: DEBUG: end
> >>> task:cb75d7f3-e333-47f1-b3e9-76ff9c0b4b21:Get ssh client for
> >>> lago-he-basic-suite-4-1-host0:\nlago.ssh: DEBUG: Running 7e67a956 on
> >>> lago-he-basic-suite-4-1-host0: grep Permission denied
> >>> /var/log/ovirt-hosted-engine-ha/agent.log\nlago.ssh: DEBUG: Command
7e67a956
> >>> on lago-he-basic-suite-4-1-host0 returned with 2\nlago.ssh: DEBUG:
Command
> >>> 7e67a956 on lago-he-basic-suite-4-1-host0 output:\n
> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
Permission
> >>> denied:
> >>>
'/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> >>> [Errno 13] Permission denied:
> >>>
'/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> >>> [Errno 13] Permission denied:
'/var/run/ovirt-hosted-engine-ha/vm.conf
> '\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> Permission denied:
>
'/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> [Errno 13] Permission denied:
>
'/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> [Errno 13] Permission denied:
>
'/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> [Errno 13] Permission denied:
>
'/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> [Errno 13] Permission denied:
>
'/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> [Errno 13] Permission denied:
> '/var/run/ovirt-hosted-engine-ha/vm.conf'\n\nlago.ssh: DEBUG: Command
> 7e67a956 on lago-he-basic-suite-4-1-host0 errors:\n grep: denied: No such
> file or directory\n\n--------------------- >> end captured logging <<
> ---------------------"
> >>
> >>
> >>
> >> --
> >> Didi
> >
> >
> >
> > --
> > Didi
>
>
>
> --
> Didi
> _______________________________________________
> Infra mailing list
> Infra(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/infra