add_secondary_storage_domains failed (was: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-4.1 - Build # 223 - Still Failing!)

Nir Soffer nsoffer at redhat.com
Mon Mar 19 07:58:30 UTC 2018


On Mon, Mar 19, 2018 at 9:22 AM Yedidyah Bar David <didi at redhat.com> wrote:

>  On Mon, Mar 19, 2018 at 9:01 AM, Nir Soffer <nsoffer at redhat.com> wrote:
> >
> >
> > On Sun, Mar 18, 2018 at 4:21 PM Yedidyah Bar David <didi at redhat.com>
> wrote:
> >>
> >> On Sun, Mar 18, 2018 at 2:48 PM, Yedidyah Bar David <didi at redhat.com>
> >> wrote:
> >> > On Sun, Mar 18, 2018 at 1:45 PM, Yedidyah Bar David <didi at redhat.com>
> >> > wrote:
> >> >> On Sun, Mar 18, 2018 at 11:20 AM,  <jenkins at jenkins.phx.ovirt.org>
> >> >> wrote:
> >> >>> Project:
> >> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.1/
> >> >>> Build:
> >> >>>
> http://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.1/223/
> >> >>> Build Number: 223
> >> >>> Build Status:  Still Failing
> >> >>> Triggered By: Started by timer
> >> >>
> >> >> It was broken by:
> >> >>
> >> >> [1] https://gerrit.ovirt.org/88483
> >> >>
> >> >> It should be fixed by:
> >> >>
> >> >> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1554283
> >> >>
> >> >> [2] is on modified, no idea about its status other than that.
> >> >>
> >> >> I didn't intend to merge [1] before [2] is fixed, not sure if
> >> >> Sandro didn't notice my comment there, or thought that [2] is fixed.
> >> >>
> >> >> If it's annoying we can revert [1] and re-merge when [2] is fixed.
> >> >
> >> > Talked with Gal, and pushed this, which should hopefully fix:
> >> >
> >> > https://gerrit.ovirt.org/89136
> >>
> >> It indeed seems to fix [1][2]:
> >>
> >> 13:13:24   # he_get_shared_config:
> >> 13:13:26   # he_get_shared_config: Success (in 0:00:01)
> >> 13:13:26   # sleep:
> >> 13:15:26   # sleep: Success (in 0:02:00)
> >> 13:15:26   # add_he_hosts:
> >> 13:16:18   # add_he_hosts: Success (in 0:00:52)
> >> 13:16:18   # he_check_ha_agent:
> >> 13:16:19   # he_check_ha_agent: Success (in 0:00:00)
> >>
> >> But later fails:
> >>
> >> 13:16:20   # add_secondary_storage_domains:
> >> 13:19:30 Error while running thread
> >> 13:19:30 Traceback (most recent call last):
> >> 13:19:30   File "/usr/lib/python2.7/site-packages/lago/utils.py", line
> >> 58, in _ret_via_queue
> >> 13:19:30     queue.put({'return': func()})
> >> 13:19:30   File
> >>
> >>
> "/home/jenkins/workspace/ovirt-system-tests_master_check-patch-el7-x86_64/ovirt-system-tests/he-basic-suite-4.1/test-scenarios/002_bootstrap.py",
> >> line 491, in add_nfs_storage_domain
> >> 13:19:30     add_generic_nfs_storage_domain(prefix, SD_NFS_NAME,
> >> SD_NFS_HOST_NAME, SD_NFS_PATH)
> >> 13:19:30   File
> >>
> >>
> "/home/jenkins/workspace/ovirt-system-tests_master_check-patch-el7-x86_64/ovirt-system-tests/he-basic-suite-4.1/test-scenarios/002_bootstrap.py",
> >> line 496, in add_generic_nfs_storage_domain
> >> 13:19:30     add_generic_nfs_storage_domain_4(prefix, sd_nfs_name,
> >> nfs_host_name, mount_path, sd_format, sd_type, nfs_version)
> >> 13:19:30   File
> >>
> >>
> "/home/jenkins/workspace/ovirt-system-tests_master_check-patch-el7-x86_64/ovirt-system-tests/he-basic-suite-4.1/test-scenarios/002_bootstrap.py",
> >> line 552, in add_generic_nfs_storage_domain_4
> >> 13:19:30     _add_storage_domain_4(api, p)
> >> 13:19:30   File
> >>
> >>
> "/home/jenkins/workspace/ovirt-system-tests_master_check-patch-el7-x86_64/ovirt-system-tests/he-basic-suite-4.1/test-scenarios/002_bootstrap.py",
> >> line 466, in _add_storage_domain_4
> >> 13:19:30     id=sd.id,
> >> 13:19:30   File
> >> "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 2219,
> >> in add
> >> 13:19:30     return self._internal_add(storage_domain, headers, query,
> >> wait)
> >> 13:19:30   File
> >> "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 223,
> >> in _internal_add
> >> 13:19:30     return future.wait() if wait else future
> >> 13:19:30   File
> >> "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 53, in
> >> wait
> >> 13:19:30     return self._code(response)
> >> 13:19:30   File
> >> "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 220,
> >> in callback
> >> 13:19:30     self._check_fault(response)
> >> 13:19:30   File
> >> "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 123,
> >> in _check_fault
> >> 13:19:30     self._raise_error(response, body)
> >> 13:19:30   File
> >> "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 109,
> >> in _raise_error
> >> 13:19:30     raise error
> >> 13:19:30 Error: Fault reason is "Operation Failed". Fault detail is
> >> "[Storage domain cannot be reached. Please ensure it is accessible
> >> from the host(s).]". HTTP response code is 400.
> >> 13:19:30 Error while running thread
> >>
> >> I'll retrigger now, but perhaps someone from storage wants to check.
> >> vdsm log has[3]:
> >>
> >> 2018-03-18 08:59:52,918-0400 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer]
> >> RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:539)
> >> 2018-03-18 08:59:52,924-0400 INFO  (jsonrpc/4) [vdsm.api] START
> >> prepareImage(sdUUID=u'424e809d-b7ad-4ed5-b6a1-5426d373a5d2',
> >> spUUID=u'8509c64f-cdd7-4713-9e09-a79d90ba26ed',
> >> imgUUID=u'718090ba-b36e-45cc-bcd6-597c16a766b9',
> >> leafUUID=u'23ee843b-20e0-4afe-98dc-e165334ac710', allowIllegal=False)
> >> from=::1,36020, task_id=92aeabf2-084a-4c98-a6e5-03efae38e7b3 (api:46)
> >> 2018-03-18 08:59:52,928-0400 INFO  (jsonrpc/4) [vdsm.api] FINISH
> >> prepareImage error=Volume does not exist:
> >> (u'23ee843b-20e0-4afe-98dc-e165334ac710',) from=::1,36020,
> >> task_id=92aeabf2-084a-4c98-a6e5-03efae38e7b3 (api:50)
> >> 2018-03-18 08:59:52,928-0400 ERROR (jsonrpc/4)
> >> [storage.TaskManager.Task]
> >> (Task='92aeabf2-084a-4c98-a6e5-03efae38e7b3') Unexpected error
> >> (task:872)
> >> Traceback (most recent call last):
> >>   File "/usr/share/vdsm/storage/task.py", line 879, in _run
> >>     return fn(*args, **kargs)
> >>   File "<string>", line 2, in prepareImage
> >>   File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48,
> in
> >> method
> >>     ret = func(*args, **kwargs)
> >>   File "/usr/share/vdsm/storage/hsm.py", line 3137, in prepareImage
> >>     raise se.VolumeDoesNotExist(leafUUID)
> >> VolumeDoesNotExist: Volume does not exist:
> >> (u'23ee843b-20e0-4afe-98dc-e165334ac710',)
> >
> >
> > Immediately after this failed prepare, we see:
> >
> > 2018-03-18 08:59:52,941-0400 INFO (jsonrpc/7) [vdsm.api] START
> > createVolume(sdUUID=u'424e809d-b7ad-4ed5-b6a1-5426d373a5d2',
> > spUUID=u'8509c64f-cdd7-4713-9e09-a79d90ba26ed',
> > imgUUID=u'718090ba-b36e-45cc-bcd6-597c16a766b9', size=u'1048576',
> > volFormat=5, preallocate=1, diskType=2,
> > volUUID=u'23ee843b-20e0-4afe-98dc-e165334ac710',
> > desc=u'hosted-engine.lockspace',
> > srcImgUUID=u'00000000-0000-0000-0000-000000000000',
> > srcVolUUID=u'00000000-0000-0000-0000-000000000000', initialSize=None)
> > from=::1,36020, task_id=c4c73314-ee10-419a-8265-b8b87763e99f (api:46)
> >
> > So that volume really did not exit when it was prepared, which means it
> is a
> > client
> > error, not vdsm error.
> >
> > The caller (hosted engine agent?)
>
> The caller is ovirt-system-tests. The code for
> add_secondary_storage_domains is duplicated among most of the suites,
> didn't check variations. See e.g.:
>
>
> https://github.com/oVirt/ovirt-system-tests/blob/master/basic-suite-master/test-scenarios/002_bootstrap.py#L626
>
> It's not calling vdsm directly, but uses the engine api (through the
> python sdk). Not sure if it's, as you say, "a bug in the client" (the
> engine), or we use the sdk improperly, or something else. Can someone
> please advice? Patches to ovirt-system-tests are welcome :-),
> especially patches to stress-test this, as it seems it's a sensitive
> timing issue - usually does not fail.
>

Smells like bug in the code calling the sdk. Engine talks with vdsm for
years and it is unlikely that it will have such timing bugs.

The owner of this test should check that the test is valid.

The specific suite that failed above is he-basic-suite-4.1 . I think
> that the code there is identical with basic-suite-4.1. Both are
> different from basic-suite-master. No idea if the diff is due to diff
> between the versions, or it's simply a bug fixed in the master suite
> and not fixed in 4.1 suites.
>
> > should use Volume.getInfo to check if a
> > volume
> > exits. Non existing volume is considered expected error in this call and
> > will be
> > logged as with INFO log level.
> >
> > See
> >
> https://github.com/oVirt/vdsm/blob/f286cf7825578d8997ca6abc0801a823dee247ef/lib/vdsm/storage/hsm.py#L3075
> >
> >>
> >> [1]
> >>
> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/4537/
> >> [2]
> >>
> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/4537/consoleFull
> >> [3]
> >>
> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/4537/artifact/exported-artifacts/he-basic-suite-4.1__logs/test_logs/he-basic-suite-4.1/post-002_bootstrap.py/lago-he-basic-suite-4-1-host0/_var_log/vdsm/vdsm.log
> >>
> >> >
> >> >>
> >> >>>
> >> >>> -------------------------------------
> >> >>> Changes Since Last Success:
> >> >>> -------------------------------------
> >> >>> Changes for Build #218
> >> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >> >>>
> >> >>>
> >> >>> Changes for Build #219
> >> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >> >>>
> >> >>>
> >> >>> Changes for Build #220
> >> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >> >>>
> >> >>>
> >> >>> Changes for Build #221
> >> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >> >>>
> >> >>>
> >> >>> Changes for Build #222
> >> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >> >>>
> >> >>>
> >> >>> Changes for Build #223
> >> >>> [Martin Sivak] Cleanup of test code - next() replaced with any()
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> -----------------
> >> >>> Failed Tests:
> >> >>> -----------------
> >> >>> 1 tests failed.
> >> >>> FAILED:  002_bootstrap.he_check_ha_agent
> >> >>>
> >> >>> Error Message:
> >> >>> False is not true
> >> >>> -------------------- >> begin captured logging <<
> --------------------
> >> >>> lago.ssh: DEBUG: start task:cb75d7f3-e333-47f1-b3e9-76ff9c0b4b21:Get
> >> >>> ssh client for lago-he-basic-suite-4-1-host0:
> >> >>> lago.ssh: DEBUG: end task:cb75d7f3-e333-47f1-b3e9-76ff9c0b4b21:Get
> ssh
> >> >>> client for lago-he-basic-suite-4-1-host0:
> >> >>> lago.ssh: DEBUG: Running 7e67a956 on lago-he-basic-suite-4-1-host0:
> >> >>> grep Permission denied /var/log/ovirt-hosted-engine-ha/agent.log
> >> >>> lago.ssh: DEBUG: Command 7e67a956 on lago-he-basic-suite-4-1-host0
> >> >>> returned with 2
> >> >>> lago.ssh: DEBUG: Command 7e67a956 on lago-he-basic-suite-4-1-host0
> >> >>> output:
> >> >>>  /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >> >>> Permission denied: '/var/run/ovirt-hosted-engine-ha/vm.conf'
> >> >>>
> >> >>> lago.ssh: DEBUG: Command 7e67a956 on lago-he-basic-suite-4-1-host0
> >> >>> errors:
> >> >>>  grep: denied: No such file or directory
> >> >>>
> >> >>> --------------------- >> end captured logging <<
> ---------------------
> >> >>>
> >> >>> Stack Trace:
> >> >>>   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
> >> >>>     testMethod()
> >> >>>   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in
> >> >>> runTest
> >> >>>     self.test(*self.arg)
> >> >>>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
> >> >>> 129, in wrapped_test
> >> >>>     test()
> >> >>>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
> >> >>> 59, in wrapper
> >> >>>     return func(get_test_prefix(), *args, **kwargs)
> >> >>>   File
> >> >>>
> "/home/jenkins/workspace/ovirt-system-tests_he-basic-suite-4.1/ovirt-system-tests/he-basic-suite-4.1/test-scenarios/002_bootstrap.py",
> >> >>> line 1036, in he_check_ha_agent
> >> >>>     'Permission denied' not in result.out
> >> >>>   File "/usr/lib64/python2.7/unittest/case.py", line 462, in
> >> >>> assertTrue
> >> >>>     raise self.failureException(msg)
> >> >>> "False is not true\n-------------------- >> begin captured logging
> <<
> >> >>> --------------------\nlago.ssh: DEBUG: start
> >> >>> task:cb75d7f3-e333-47f1-b3e9-76ff9c0b4b21:Get ssh client for
> >> >>> lago-he-basic-suite-4-1-host0:\nlago.ssh: DEBUG: end
> >> >>> task:cb75d7f3-e333-47f1-b3e9-76ff9c0b4b21:Get ssh client for
> >> >>> lago-he-basic-suite-4-1-host0:\nlago.ssh: DEBUG: Running 7e67a956 on
> >> >>> lago-he-basic-suite-4-1-host0: grep Permission denied
> >> >>> /var/log/ovirt-hosted-engine-ha/agent.log\nlago.ssh: DEBUG: Command
> 7e67a956
> >> >>> on lago-he-basic-suite-4-1-host0 returned with 2\nlago.ssh: DEBUG:
> Command
> >> >>> 7e67a956 on lago-he-basic-suite-4-1-host0 output:\n
> >> >>> /var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> Permission
> >> >>> denied:
> >> >>>
> '/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> >> >>> [Errno 13] Permission denied:
> >> >>>
> '/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> >> >>> [Errno 13] Permission denied:
> '/var/run/ovirt-hosted-engine-ha/vm.conf
> >>  '\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError: [Errno 13]
> >> Permission denied:
> >>
> '/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> >> [Errno 13] Permission denied:
> >>
> '/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> >> [Errno 13] Permission denied:
> >>
> '/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> >> [Errno 13] Permission denied:
> >>
> '/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> >> [Errno 13] Permission denied:
> >>
> '/var/run/ovirt-hosted-engine-ha/vm.conf'\n/var/log/ovirt-hosted-engine-ha/agent.log:IOError:
> >> [Errno 13] Permission denied:
> >> '/var/run/ovirt-hosted-engine-ha/vm.conf'\n\nlago.ssh: DEBUG: Command
> >> 7e67a956 on lago-he-basic-suite-4-1-host0  errors:\n grep: denied: No
> such
> >> file or directory\n\n--------------------- >> end captured logging <<
> >> ---------------------"
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Didi
> >> >
> >> >
> >> >
> >> > --
> >> > Didi
> >>
> >>
> >>
> >> --
> >> Didi
> >> _______________________________________________
> >> Infra mailing list
> >> Infra at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/infra
>
>
>
> --
> Didi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20180319/dc8521bf/attachment-0001.html>


More information about the Infra mailing list