Ok, maybe I've found something. I was comparing two sets of log files -
one for basic-suite-4.3 run that succeeded and one that did not.
The deployment procedure for host-1 is not complete in the unsuccessful
case. The last log entry is:
2019-03-22 17:54:49,973-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting vdsm.
In the successful case we can see:
2019-03-25 06:16:31,789-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting vdsm.
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy)
[3add0cac] Got: **%EventEnd STAGE closeup METHOD
otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start
(odeploycons.packages.vdsm.started)
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy)
[3add0cac] Got: **%EventStart STAGE closeup METHOD
otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy)
[3add0cac] Got: ***L:INFO Starting ovirt-vmconsole-host-sshd
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy)
[3add0cac] nextEvent: Log INFO Starting ovirt-vmconsole-host-sshd
2019-03-25 06:16:38,261-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd.
2019-03-25 06:16:39,479-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy)
[3add0cac] Got: **%EventEnd STAGE closeup METHOD
otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
...
It looks like ovirt-host-deploy plugin did not respond?
The timing of:
Sandro, are you aware of any issues on ovirt-host-deploy plugin that
might have caused these?
Dafna, can we temporarily blacklist the latest version of
ovirt-host-deploy to see if it fixes CQ?
Marcin
On 3/25/19 2:45 PM, Dafna Ron wrote:
yes. I suspect this is the problem.
In the host I can see in the yum log the times the vdsm packages are
installed:
[dron@dron post-002_bootstrap.py]$ less
lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm
Mar 25 07:48:49 Installed: vdsm-api-4.30.11-23.git276b602.el7.noarch
Mar 25 07:48:52 Installed: vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:00 Installed: vdsm-common-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:00 Installed:
vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:38 Installed: vdsm-network-4.30.11-23.git276b602.el7.x86_64
Mar 25 07:49:38 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:39 Installed: vdsm-client-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:39 Installed: vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:39 Installed: vdsm-http-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:42 Installed:
vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:57 Installed:
vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:58 Installed: vdsm-4.30.11-23.git276b602.el7.x86_64
Mar 25 07:49:58 Installed:
vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:58 Installed: vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch
and in the engine you can see that the engine already failed
installation:
2019-03-25 07:48:33,920-04 INFO [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand
internal: false. Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type:
ClusterAction group CREATE_HOST with role type ADMIN
On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron <dron(a)redhat.com
<mailto:dron@redhat.com>> wrote:
I am also adding didi
should this not be logged in host-deploy?
On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk
<msobczyk(a)redhat.com <mailto:msobczyk@redhat.com>> wrote:
Indeed in my failed job, I can see, that 'vdsm-python' is
installed *after* engine's attempt to run 'vdsm-tool':
lago-basic-suite-4-3-host-1/_var_log/yum.log
403:Mar 22 17:54:12 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch
lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log
1580:2019-03-22 17:53:03,793-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool vdsm-id'
1581:2019-03-22 17:53:04,043-04 WARN
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id request on host: Command
returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1'
On 3/25/19 2:22 PM, Dafna Ron wrote:
> and its only failing on one host as well.
>
>
> On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk
> <msobczyk(a)redhat.com <mailto:msobczyk@redhat.com>> wrote:
>
> Hmmm, that would suggest, that 'vdsm-tool' command is not
> found on the host. There was no work done around
> 'vdsm-tool's 'vdsm-id' recently.
>
> On 3/25/19 2:08 PM, Dafna Ron wrote:
>> Actually, I can see in engine that it is trying to start
>> vdsm installation on the host:
>>
>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] Entered
SsoRestApiAuthFilter
>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter authenticating with sso
>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter authenticating using BEARER header
>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter successfully authenticated using BEARER header
>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) []
Entered SsoRestApiNegotiationFilter
>> 2019-03-25 07:48:33,124-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) []
SsoRestApiNegotiationFilter Not performing Negotiate Auth
>> 2019-03-25 07:48:33,143-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getclusterforuserbyclustername(?, ?, ?)}]
>> 2019-03-25 07:48:33,143-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetClusterForUserByClusterName] compiled
>> 2019-03-25 07:48:33,145-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, params: [GetClusterByName,
NameQueryParameters:{refresh='false', filtered='false'}], timeElapsed:
7ms
>> 2019-03-25 07:48:33,161-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] Compiled stored procedure. Call string is [{call
getvmstaticbyname(?)}]
>> 2019-03-25 07:48:33,161-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
>> 2019-03-25 07:48:33,206-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getvdsbyname(?)}]
>> 2019-03-25 07:48:33,206-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByName] compiled
>> 2019-03-25 07:48:33,223-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 20ms
>> 2019-03-25 07:48:33,225-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getvdsbyhostname(?)}]
>> 2019-03-25 07:48:33,225-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByHostName] compiled
>> 2019-03-25 07:48:33,252-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: getAllForHostname, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 28ms
>> 2019-03-25 07:48:33,254-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getstorage_poolsbyclusterid(?)}]
>> 2019-03-25 07:48:33,254-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[Getstorage_poolsByClusterId] compiled
>> 2019-03-25 07:48:33,262-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Connecting
'root@lago-basic-suite-4-3-host-1'
>> 2019-03-25 07:48:33,461-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Connected:
'root@lago-basic-suite-4-3-host-1'
>> 2019-03-25 07:48:33,467-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating:
'root@lago-basic-suite-4-3-host-1'
>> 2019-03-25 07:48:33,599-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated:
'root@lago-basic-suite-4-3-host-1'
>> 2019-03-25 07:48:33,600-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool vdsm-id'
>> 2019-03-25 07:48:33,902-04 WARN
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request on host: Command
returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1'
>> 2019-03-25 07:48:33,920-04 INFO
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false.
Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group
CREATE_HOST with role type ADMIN
>>
>>
>>
>> On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron
>> <dron(a)redhat.com <mailto:dron@redhat.com>> wrote:
>>
>> we have another failing patch with the same test:
>>
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/
>>
>> its obviously not related to the patch but there is
>> something that is causing these failures randomly.
>> From what I can see in the current failed job, you
>> are correct and host-1 and engine does not even try
>> to deploy host-1.
>>
>> i can see host-1 is getting error 127 in lago for
>> ntpd and I can see network manager errors in the
>> host messages log
>>
>> In the host messages log I can see several messages
>> that I suspect may cause issues in communication
>> between engine and host:
>>
>> Mar 25 07:50:09 lago-basic-suite-4-3-host-1
>> sasldblistusers2: _sasldb_getkeyhandle has failed
>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1
>> saslpasswd2: error deleting entry from sasldb:
>> BDB0073 DB_NOTFOUND: No matching key/data pair found
>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1
>> saslpasswd2: error deleting entry from sasldb:
>> BDB0073 DB_NOTFOUND: No matching key/data pair found
>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1
>> saslpasswd2: error deleting entry from sasldb:
>> BDB0073 DB_NOTFOUND: No matching key/data pair found
>>
>>
>> ces/12)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bonding: bondscan-UMJa2S is being created...
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bonding: bondscan-UMJa2S is being deleted...
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
>> NetworkManager[2658]: <info> [1553514613.7774]
>> manager: (bondscan-UMJa2S): new Bond device
>> (/org/freedesktop/NetworkManager/Devices/13)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> (unregistered net_device) (unregistering): Released
>> all slaves
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bonding: bondscan-liwvMR is being created...
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option fail_over_mac: invalid value (3)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option ad_select: invalid value (3)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option lacp_rate: mode dependency
>> failed, not supported in mode balance-rr(0)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option arp_validate: invalid value (7)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option xmit_hash_policy: invalid
>> value (5)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
>> NetworkManager[2658]: <info> [1553514613.8002]
>> manager: (bondscan-liwvMR): new Bond device
>> (/org/freedesktop/NetworkManager/Devices/14)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option primary_reselect: invalid
>> value (3)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option arp_all_targets: invalid
>> value (2)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bonding: bondscan-liwvMR is being deleted...
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> (unregistered net_device) (unregistering): Released
>> all slaves
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bonding: bondscan-liwvMR is being created...
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option fail_over_mac: invalid value (3)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option ad_select: invalid value (3)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option lacp_rate: mode dependency
>> failed, not supported in mode active-backup(1)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option arp_validate: invalid value (7)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
>> NetworkManager[2658]: <info> [1553514613.8429]
>> manager: (bondscan-liwvMR): new Bond device
>> (/org/freedesktop/NetworkManager/Devices/15)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option xmit_hash_policy: invalid
>> value (5)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option primary_reselect: invalid
>> value (3)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bondscan-liwvMR: option arp_all_targets: invalid
>> value (2)
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> bonding: bondscan-liwvMR is being deleted...
>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel:
>> (unregistered net_device) (unregistering): Released
>> all slaves
>>
>>
>> On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk
>> <msobczyk(a)redhat.com <mailto:msobczyk@redhat.com>>
>> wrote:
>>
>> For the failed job, the engine didn't even try
>> to deploy on host-1:
>>
>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/...
>>
>> Martin, do you know what could be the reason for
>> that?
>>
>> I can see in the logs for both successful and
>> unsuccessful basic-suite-4.3 runs, that there is
>> no 'ntpdate' on host-1:
>>
>> 2019-03-25
10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 on
lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine
>> 2019-03-25
10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 returned with 127
>> 2019-03-25
10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 errors:
>> bash: ntpdate: command not found
>>
>> On host-0 everything is ok:
>>
>> 2019-03-25
10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 on
lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine
>> 2019-03-25
10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 returned with 0
>> 2019-03-25
10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 output:
>> 25 Mar 06:14:53 ntpdate[6646]: adjust time server
192.168.202.2 offset 0.017150 sec
>>
>> On 3/25/19 10:13 AM, Eyal Edri wrote:
>>> Still fails, now on a different component. (
>>> ovirt-web-ui-extentions )
>>>
>>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/
>>>
>>> On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg
>>> <danken(a)redhat.com
<mailto:danken@redhat.com>>
>>> wrote:
>>>
>>>
>>>
>>> On Fri, Mar 22, 2019 at 3:21 PM Marcin
>>> Sobczyk <msobczyk(a)redhat.com
>>> <mailto:msobczyk@redhat.com>> wrote:
>>>
>>> Dafna,
>>>
>>> in 'verify_add_hosts' we specifically
>>> wait for single host to be up with a
>>> timeout:
>>>
>>> 144 up_hosts =
hosts_service.list(search='datacenter={} AND status=up'.format(DC_NAME))
>>> 145 if len(up_hosts):
>>> 146 return True
>>>
>>> The log files say, that it took ~50
>>> secs for one of the hosts to be up
>>> (seems reasonable) and no timeout is
>>> being reported.
>>> Just after running 'verify_add_hosts',
>>> we run 'add_master_storage_domain',
>>> which calls '_hosts_in_dc' function.
>>> That function does the exact same
>>> check, but it fails:
>>>
>>> 113 hosts =
hosts_service.list(search='datacenter={} AND status=up'.format(dc_name))
>>> 114 if hosts:
>>> 115 if random_host:
>>> 116 return random.choice(hosts)
>>>
>>> I don't think it is relevant to our current
>>> failure; but I consider random_host=True as
>>> a bad practice. As if we do not have enough
>>> moving parts, we are adding intentional
>>> randomness. Reproducibility is far more
>>> important than coverage - particularly for
>>> a shared system test like OST.
>>>
>>> 117 else:
>>> 118 return sorted(hosts, key=lambda
host:host.name <
http://host.name>)
>>> 119 raise RuntimeError('Could not find
hosts that are up in DC %s' % dc_name)
>>>
>>> I'm also not able to reproduce this
>>> issue locally on my server. The
>>> investigation continues...
>>>
>>>
>>> I think that it would be fair to take the
>>> filtering by host state out of Engine and
>>> into the test, where we can easily log the
>>> current status of each host. Then we'd have
>>> better understanding on the next failure.
>>>
>>> On 3/22/19 1:17 PM, Marcin Sobczyk wrote:
>>>>
>>>> Hi,
>>>>
>>>> sure, I'm on it - it's weird though,
I
>>>> did ran 4.3 basic suite for this patch
>>>> manually and everything was ok.
>>>>
>>>> On 3/22/19 1:05 PM, Dafna Ron wrote:
>>>>> Hi,
>>>>>
>>>>> We are failing branch 4.3 for test:
>>>>> 002_bootstrap.add_master_storage_domain
>>>>>
>>>>> It seems that in one of the hosts,
>>>>> the vdsm is not starting
>>>>> there is nothing in vdsm.log or in
>>>>> supervdsm.log
>>>>>
>>>>> CQ identified this patch as the
>>>>> suspected root cause:
>>>>>
>>>>>
https://gerrit.ovirt.org/#/c/98748/ -
>>>>> vdsm: client: Add support for flow id
>>>>>
>>>>> Milan, Marcin, can you please have a
>>>>> look?
>>>>>
>>>>> full logs:
>>>>>
>>>>>
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/b...
>>>>>
>>>>> the only error I can see is about
>>>>> host not being up (makes sense as
>>>>> vdsm is not running)
>>>>>
>>>>>
>>>>> Stacktrace
>>>>>
>>>>> File
"/usr/lib64/python2.7/unittest/case.py", line 369, in run
>>>>> testMethod()
>>>>> File
"/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
>>>>> self.test(*self.arg)
>>>>> File
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in
wrapped_test
>>>>> test()
>>>>> File
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper
>>>>> return func(get_test_prefix(),
*args, **kwargs)
>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 417, in add_master_storage_domain
>>>>> add_iscsi_storage_domain(prefix)
>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 561, in add_iscsi_storage_domain
>>>>> host=_random_host_from_dc(api,
DC_NAME),
>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 122, in _random_host_from_dc
>>>>> return _hosts_in_dc(api, dc_name,
True)
>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 119, in _hosts_in_dc
>>>>> raise RuntimeError('Could not
find hosts that are up in DC %s' % dc_name)
>>>>> 'Could not find hosts that are up in
DC test-dc\n-------------------- >> begin captured logging <<
--------------------\nlago.ssh: DEBUG: starttask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get
ssh client for lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG:
endtask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on
lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine returned with 0\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine output:\n
3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
>>> _______________________________________________
>>> Devel mailing list -- devel(a)ovirt.org
>>> <mailto:devel@ovirt.org>
>>> To unsubscribe send an email to
>>> devel-leave(a)ovirt.org
>>> <mailto:devel-leave@ovirt.org>
>>> Privacy Statement:
>>>
https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
>>>
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZY...
>>>
>>> _______________________________________________
>>> Devel mailing list -- devel(a)ovirt.org
>>> <mailto:devel@ovirt.org>
>>> To unsubscribe send an email to
>>> devel-leave(a)ovirt.org
>>> <mailto:devel-leave@ovirt.org>
>>> Privacy Statement:
>>>
https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
>>>
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZF...
>>>
>>>
>>>
>>> --
>>>
>>> Eyal edri
>>>
>>>
>>> MANAGER
>>>
>>> RHV/CNV DevOps
>>>
>>> EMEA VIRTUALIZATION R&D
>>>
>>>
>>> Red Hat EMEA <
https://www.redhat.com/>
>>>
>>> <
https://red.ht/sig> TRIED. TESTED. TRUSTED.
>>> <
https://redhat.com/trusted>
>>>
>>> phone: +972-9-7692018
>>> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>> _______________________________________________
>> Devel mailing list -- devel(a)ovirt.org
>> <mailto:devel@ovirt.org>
>> To unsubscribe send an email to
>> devel-leave(a)ovirt.org <mailto:devel-leave@ovirt.org>
>> Privacy Statement:
>>
https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct:
>>
https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUV...
>>