and its only failing on one host as well.
On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk <msobczyk(a)redhat.com> wrote:
Hmmm, that would suggest, that 'vdsm-tool' command is not
found on the
host. There was no work done around 'vdsm-tool's 'vdsm-id' recently.
On 3/25/19 2:08 PM, Dafna Ron wrote:
Actually, I can see in engine that it is trying to start vdsm installation
on the host:
2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter]
(default task-1) [] Entered SsoRestApiAuthFilter
2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter]
(default task-1) [] SsoRestApiAuthFilter authenticating with sso
2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter]
(default task-1) [] SsoRestApiAuthFilter authenticating using BEARER header
2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter]
(default task-1) [] SsoRestApiAuthFilter successfully authenticated using BEARER header
2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) []
Entered SsoRestApiNegotiationFilter
2019-03-25 07:48:33,124-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) []
SsoRestApiNegotiationFilter Not performing Negotiate Auth
2019-03-25 07:48:33,143-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getclusterforuserbyclustername(?, ?, ?)}]
2019-03-25 07:48:33,143-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetClusterForUserByClusterName] compiled
2019-03-25 07:48:33,145-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, params: [GetClusterByName,
NameQueryParameters:{refresh='false', filtered='false'}], timeElapsed:
7ms
2019-03-25 07:48:33,161-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] Compiled stored procedure. Call string is [{call
getvmstaticbyname(?)}]
2019-03-25 07:48:33,161-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
2019-03-25 07:48:33,206-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getvdsbyname(?)}]
2019-03-25 07:48:33,206-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByName] compiled
2019-03-25 07:48:33,223-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 20ms
2019-03-25 07:48:33,225-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getvdsbyhostname(?)}]
2019-03-25 07:48:33,225-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByHostName] compiled
2019-03-25 07:48:33,252-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: getAllForHostname, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 28ms
2019-03-25 07:48:33,254-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getstorage_poolsbyclusterid(?)}]
2019-03-25 07:48:33,254-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[Getstorage_poolsByClusterId] compiled
2019-03-25 07:48:33,262-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Connecting
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,461-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Connected:
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,467-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating:
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,599-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated:
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,600-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool
vdsm-id'
2019-03-25 07:48:33,902-04 WARN [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request
on host: Command returned failure code 127 during SSH session
'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,920-04 INFO [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand
internal: false. Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type:
ClusterAction group CREATE_HOST with role type ADMIN
On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron <dron(a)redhat.com> wrote:
> we have another failing patch with the same test:
>
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/
>
> its obviously not related to the patch but there is something that is
> causing these failures randomly.
> From what I can see in the current failed job, you are correct and host-1
> and engine does not even try to deploy host-1.
>
> i can see host-1 is getting error 127 in lago for ntpd and I can see
> network manager errors in the host messages log
>
> In the host messages log I can see several messages that I suspect may
> cause issues in communication between engine and host:
>
> Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2:
> _sasldb_getkeyhandle has failed
> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting
> entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found
> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting
> entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found
> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting
> entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found
>
>
> ces/12)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
> bondscan-UMJa2S is being created...
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
> bondscan-UMJa2S is being deleted...
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info>
> [1553514613.7774] manager: (bondscan-UMJa2S): new Bond device
> (/org/freedesktop/NetworkManager/Devices/13)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
> net_device) (unregistering): Released all slaves
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
> bondscan-liwvMR is being created...
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option fail_over_mac: invalid value (3)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option ad_select: invalid value (3)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option lacp_rate: mode dependency failed, not supported in mode
> balance-rr(0)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option arp_validate: invalid value (7)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option xmit_hash_policy: invalid value (5)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info>
> [1553514613.8002] manager: (bondscan-liwvMR): new Bond device
> (/org/freedesktop/NetworkManager/Devices/14)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option primary_reselect: invalid value (3)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option arp_all_targets: invalid value (2)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
> bondscan-liwvMR is being deleted...
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
> net_device) (unregistering): Released all slaves
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
> bondscan-liwvMR is being created...
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option fail_over_mac: invalid value (3)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option ad_select: invalid value (3)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option lacp_rate: mode dependency failed, not supported in mode
> active-backup(1)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option arp_validate: invalid value (7)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info>
> [1553514613.8429] manager: (bondscan-liwvMR): new Bond device
> (/org/freedesktop/NetworkManager/Devices/15)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option xmit_hash_policy: invalid value (5)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option primary_reselect: invalid value (3)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR:
> option arp_all_targets: invalid value (2)
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding:
> bondscan-liwvMR is being deleted...
> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered
> net_device) (unregistering): Released all slaves
>
>
> On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk <msobczyk(a)redhat.com>
> wrote:
>
>> For the failed job, the engine didn't even try to deploy on host-1:
>>
>>
>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/...
>>
>> Martin, do you know what could be the reason for that?
>>
>> I can see in the logs for both successful and unsuccessful
>> basic-suite-4.3 runs, that there is no 'ntpdate' on host-1:
>>
>> 2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 on
lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine
>> 2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 returned with 127
>> 2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 errors:
>> bash: ntpdate: command not found
>>
>> On host-0 everything is ok:
>>
>> 2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 on
lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine
>> 2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 returned with 0
>> 2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 output:
>> 25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 offset 0.017150
sec
>>
>> On 3/25/19 10:13 AM, Eyal Edri wrote:
>>
>> Still fails, now on a different component. ( ovirt-web-ui-extentions )
>>
>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/
>>
>> On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg <danken(a)redhat.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk <msobczyk(a)redhat.com>
>>> wrote:
>>>
>>>> Dafna,
>>>>
>>>> in 'verify_add_hosts' we specifically wait for single host to be
up
>>>> with a timeout:
>>>>
>>>> 144 up_hosts = hosts_service.list(search='datacenter={} AND
status=up'.format(DC_NAME))
>>>> 145 if len(up_hosts):
>>>> 146 return True
>>>>
>>>> The log files say, that it took ~50 secs for one of the hosts to be up
>>>> (seems reasonable) and no timeout is being reported.
>>>> Just after running 'verify_add_hosts', we run
>>>> 'add_master_storage_domain', which calls '_hosts_in_dc'
function.
>>>> That function does the exact same check, but it fails:
>>>>
>>>> 113 hosts = hosts_service.list(search='datacenter={} AND
status=up'.format(dc_name))
>>>> 114 if hosts:
>>>> 115 if random_host:
>>>> 116 return random.choice(hosts)
>>>>
>>>> I don't think it is relevant to our current failure; but I consider
>>> random_host=True as a bad practice. As if we do not have enough moving
>>> parts, we are adding intentional randomness. Reproducibility is far more
>>> important than coverage - particularly for a shared system test like OST.
>>>
>>>> 117 else:
>>>> 118 return sorted(hosts, key=lambda host: host.name)
>>>> 119 raise RuntimeError('Could not find hosts that are up in DC
%s' % dc_name)
>>>>
>>>>
>>>> I'm also not able to reproduce this issue locally on my server. The
>>>> investigation continues...
>>>>
>>>
>>> I think that it would be fair to take the filtering by host state out
>>> of Engine and into the test, where we can easily log the current status of
>>> each host. Then we'd have better understanding on the next failure.
>>>
>>> On 3/22/19 1:17 PM, Marcin Sobczyk wrote:
>>>>
>>>> Hi,
>>>>
>>>> sure, I'm on it - it's weird though, I did ran 4.3 basic suite
for
>>>> this patch manually and everything was ok.
>>>> On 3/22/19 1:05 PM, Dafna Ron wrote:
>>>>
>>>> Hi,
>>>>
>>>> We are failing branch 4.3 for test:
>>>> 002_bootstrap.add_master_storage_domain
>>>>
>>>> It seems that in one of the hosts, the vdsm is not starting
>>>> there is nothing in vdsm.log or in supervdsm.log
>>>>
>>>> CQ identified this patch as the suspected root cause:
>>>>
>>>>
https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add support for
>>>> flow id
>>>>
>>>> Milan, Marcin, can you please have a look?
>>>>
>>>> full logs:
>>>>
>>>>
>>>>
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/b...
>>>>
>>>> the only error I can see is about host not being up (makes sense as
>>>> vdsm is not running)
>>>>
>>>> Stacktrace
>>>>
>>>> File "/usr/lib64/python2.7/unittest/case.py", line 369, in
run
>>>> testMethod()
>>>> File "/usr/lib/python2.7/site-packages/nose/case.py", line
197, in runTest
>>>> self.test(*self.arg)
>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",
line 142, in wrapped_test
>>>> test()
>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",
line 60, in wrapper
>>>> return func(get_test_prefix(), *args, **kwargs)
>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 417, in add_master_storage_domain
>>>> add_iscsi_storage_domain(prefix)
>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 561, in add_iscsi_storage_domain
>>>> host=_random_host_from_dc(api, DC_NAME),
>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 122, in _random_host_from_dc
>>>> return _hosts_in_dc(api, dc_name, True)
>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 119, in _hosts_in_dc
>>>> raise RuntimeError('Could not find hosts that are up in DC
%s' % dc_name)
>>>> 'Could not find hosts that are up in DC test-dc\n--------------------
>> begin captured logging << --------------------\nlago.ssh: DEBUG: start
task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end
task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on
lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine returned with 0\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine output:\n
3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
>>>>
>>>> _______________________________________________
>>>> Devel mailing list -- devel(a)ovirt.org
>>>> To unsubscribe send an email to devel-leave(a)ovirt.org
>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>> oVirt Code of Conduct:
>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZY...
>>>>
>>> _______________________________________________
>>> Devel mailing list -- devel(a)ovirt.org
>>> To unsubscribe send an email to devel-leave(a)ovirt.org
>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
>>>
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZF...
>>>
>>
>>
>> --
>>
>> Eyal edri
>>
>>
>> MANAGER
>>
>> RHV/CNV DevOps
>>
>> EMEA VIRTUALIZATION R&D
>>
>>
>> Red Hat EMEA <
https://www.redhat.com/>
>> <
https://red.ht/sig> TRIED. TESTED. TRUSTED.
>> <
https://redhat.com/trusted>
>> phone: +972-9-7692018
>> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>>
>> _______________________________________________
>> Devel mailing list -- devel(a)ovirt.org
>> To unsubscribe send an email to devel-leave(a)ovirt.org
>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct:
>>
https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUV...
>>
>