Hi,
just to keep everyone up to date - Ravi wrote a patch that most probably
will resolve this issue:
Dafna, please keep an eye on the results of CQ pipelines to come and
give us feedback if the patch helped or not.
Thanks, Marcin
On 3/29/19 3:44 PM, Dafna Ron wrote:
Ok, thanks.
On Fri, Mar 29, 2019 at 11:20 AM Martin Perina <mperina(a)redhat.com
<mailto:mperina@redhat.com>> wrote:
On Fri, 29 Mar 2019, 11:54 Dafna Ron, <dron(a)redhat.com
<mailto:dron@redhat.com>> wrote:
Thanks Martin.
I will follow once merged and report the next failure so we
can have a better view.
I'm not going to merge it, but I will try to execute OST with it
until failure.
On Fri, Mar 29, 2019 at 10:07 AM Martin Perina
<mperina(a)redhat.com <mailto:mperina@redhat.com>> wrote:
On Fri, Mar 29, 2019 at 11:04 AM Dafna Ron
<dron(a)redhat.com <mailto:dron@redhat.com>> wrote:
Thank you Marcin :) I am also adding Didi to see if he
knows of any changes.
Anton, who from Engine/api side can help to take it
from here?
On Fri, Mar 29, 2019 at 9:37 AM Marcin Sobczyk
<msobczyk(a)redhat.com <mailto:msobczyk@redhat.com>> wrote:
Hi,
so the thing boils down to this:
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/471/artifact/...
This is 'httpd' log file from the most recent job
that failed 'add_master_storage_domain'.
At the end of the log file, we can see:
192.168.202.1 - - [28/Mar/2019:11:05:38 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 11845
192.168.202.1 - - [28/Mar/2019:11:05:41 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 11847
192.168.202.1 - - [28/Mar/2019:11:05:44 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 11847
192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 5959
192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 200
7826
192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET
/ovirt-engine/api HTTP/1.1" 200 5112
192.168.202.1 - - [28/Mar/2019:11:05:48 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 200 65
192.168.202.2 - - [28/Mar/2019:11:05:50 -0400] "POST
/ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 223
The first batch of 'status=installing or
status=initializing' requests and the first
'status=up' request is 'verify_add_hosts'.
The second 'status=up' request is the one that
comes from '_hosts_in_dc' call from
'add_master_storage_domain'.
They are executed with a 1 second time difference,
yet the results are not the same - 7826 bytes vs
65 bytes.
I went through every line of 'engine.log' from
that 1 sec interval and couldn't find anything,
that would indicate the host is going down.
I've asked Ravi for help yesterday and I've found
out, that the results of these requests are
queries to our postgres db and they are somehow
cached.
I see 2 options here - either something really
changed the db contents during that 1 second or
there's a bug in our API/cache that returns the
wrong info.
I think we can eliminate the first option by
adding collection of postgres db to our OST suites
and looking at it's history after a failure.
I don't really have a solution for the second
possibility - any help from someone who knows
engine's code better would be appreciated.
I've created
https://gerrit.ovirt.org/99027 to add logging
of hosts and statuses return by those search queries, so
we should have more data what's exactly returned
On 3/26/19 5:37 PM, Dafna Ron wrote:
> We never remove any packages from tested :)
> I am actually suspecting a problem that is caused
> by something that is shared by all projects like
> the api.
> we currently have 3 different tests that are
> randomly failing because of engine/host
> communication and I am starting to think they are
> all connected.
>
>
> On Tue, Mar 26, 2019 at 2:03 PM Eyal Edri
> <eedri(a)redhat.com <mailto:eedri@redhat.com>> wrote:
>
>
>
> On Tue, Mar 26, 2019 at 3:56 PM Sandro
> Bonazzola <sbonazzo(a)redhat.com
> <mailto:sbonazzo@redhat.com>> wrote:
>
>
>
> Il giorno mar 26 mar 2019 alle ore 14:42
> Marcin Sobczyk <msobczyk(a)redhat.com
> <mailto:msobczyk@redhat.com>> ha scritto:
>
> Ok, maybe I've found something. I was
> comparing two sets of log files - one
> for basic-suite-4.3 run that
> succeeded and one that did not.
>
> The deployment procedure for host-1
> is not complete in the unsuccessful
> case. The last log entry is:
>
> 2019-03-22 17:54:49,973-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting vdsm.
>
> In the successful case we can see:
>
> 2019-03-25 06:16:31,789-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting vdsm.
> 2019-03-25 06:16:38,257-04 DEBUG
[org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventEnd STAGE
closeup METHOD otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start
(odeploycons.packages.vdsm.started)
> 2019-03-25 06:16:38,257-04 DEBUG
[org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventStart
STAGE closeup METHOD otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start
(None)
> 2019-03-25 06:16:38,257-04 DEBUG
[org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: ***L:INFO
Starting ovirt-vmconsole-host-sshd
> 2019-03-25 06:16:38,257-04 DEBUG
[org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] nextEvent: Log INFO
Starting ovirt-vmconsole-host-sshd
> 2019-03-25 06:16:38,261-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd.
> 2019-03-25 06:16:39,479-04 DEBUG
[org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventEnd STAGE
closeup METHOD otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
> ...
>
> It looks like ovirt-host-deploy
> plugin did not respond?
>
> The timing of:
>
>
https://gerrit.ovirt.org/#/c/98762/
>
> kinda aligns with my patch which was
> merged for vdsm-4.3 and was a suspect:
>
>
https://gerrit.ovirt.org/#/c/98748/
>
> Sandro, are you aware of any issues
> on ovirt-host-deploy plugin that
> might have caused these?
>
>
> I don't really see how a patch touching
> the automation configuratin for STDCI v2
> can be the cause of this but please go
> ahead and blacklist last build.
> It should differ from previous one only
> by the timestamp of the rpm :-)
>
Ehhhh, yeah, I'm so tired of looking at log files
I'm starting to see patterns everywhere...
>
> If you want to verify a specific version of
> an RPM works , you can use the manual job for it.
>
> Dafna, can we temporarily blacklist
> the latest version of
> ovirt-host-deploy to see if it fixes CQ?
>
> Marcin
>
> On 3/25/19 2:45 PM, Dafna Ron wrote:
>> yes. I suspect this is the problem.
>> In the host I can see in the yum log
>> the times the vdsm packages are
>> installed:
>> [dron@dron post-002_bootstrap.py]$
>> less
>> lago-basic-suite-4-3-host-1/_var_log/yum.log
>> |grep vdsm
>> Mar 25 07:48:49 Installed:
>> vdsm-api-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:48:52 Installed:
>> vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:49:00 Installed:
>> vdsm-common-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:49:00 Installed:
>>
vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:49:38 Installed:
>> vdsm-network-4.30.11-23.git276b602.el7.x86_64
>> Mar 25 07:49:38 Installed:
>> vdsm-python-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:49:39 Installed:
>> vdsm-client-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:49:39 Installed:
>> vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:49:39 Installed:
>> vdsm-http-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:49:42 Installed:
>>
vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:49:57 Installed:
>>
vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:49:58 Installed:
>> vdsm-4.30.11-23.git276b602.el7.x86_64
>> Mar 25 07:49:58 Installed:
>>
vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch
>> Mar 25 07:49:58 Installed:
>> vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch
>>
>> and in the engine you can see that
>> the engine already failed installation:
>> 2019-03-25 07:48:33,920-04 INFO
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false.
Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group
CREATE_HOST with role type ADMIN
>>
>> On Mon, Mar 25, 2019 at 1:35 PM
>> Dafna Ron <dron(a)redhat.com
>> <mailto:dron@redhat.com>> wrote:
>>
>> I am also adding didi
>> should this not be logged in
>> host-deploy?
>>
>>
>> On Mon, Mar 25, 2019 at 1:33 PM
>> Marcin Sobczyk
>> <msobczyk(a)redhat.com
>> <mailto:msobczyk@redhat.com>>
wrote:
>>
>> Indeed in my failed job, I
>> can see, that 'vdsm-python'
>> is installed *after*
>> engine's attempt to run
>> 'vdsm-tool':
>>
>>
lago-basic-suite-4-3-host-1/_var_log/yum.log
>> 403:Mar 22 17:54:12 Installed:
vdsm-python-4.30.11-23.git276b602.el7.noarch
>>
>>
lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log
>> 1580:2019-03-22 17:53:03,793-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool vdsm-id'
>> 1581:2019-03-22 17:53:04,043-04 WARN
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id request on host: Command
returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1'
>>
>>
>> On 3/25/19 2:22 PM, Dafna
>> Ron wrote:
>>> and its only failing on one
>>> host as well.
>>>
>>>
>>> On Mon, Mar 25, 2019 at
>>> 1:17 PM Marcin Sobczyk
>>> <msobczyk(a)redhat.com
>>>
<mailto:msobczyk@redhat.com>>
>>> wrote:
>>>
>>> Hmmm, that would
>>> suggest, that
>>> 'vdsm-tool' command is
>>> not found on the host.
>>> There was no work done
>>> around 'vdsm-tool's
>>> 'vdsm-id' recently.
>>>
>>> On 3/25/19 2:08 PM,
>>> Dafna Ron wrote:
>>>> Actually, I can see in
>>>> engine that it is
>>>> trying to start vdsm
>>>> installation on the host:
>>>>
>>>> 2019-03-25 07:48:33,123-04
DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] Entered
SsoRestApiAuthFilter
>>>> 2019-03-25 07:48:33,123-04
DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter authenticating with sso
>>>> 2019-03-25 07:48:33,123-04
DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter authenticating using BEARER header
>>>> 2019-03-25 07:48:33,123-04
DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter successfully authenticated using BEARER header
>>>> 2019-03-25 07:48:33,123-04
DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) []
Entered SsoRestApiNegotiationFilter
>>>> 2019-03-25 07:48:33,124-04
DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) []
SsoRestApiNegotiationFilter Not performing Negotiate Auth
>>>> 2019-03-25 07:48:33,143-04
DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getclusterforuserbyclustername(?, ?, ?)}]
>>>> 2019-03-25 07:48:33,143-04
DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetClusterForUserByClusterName] compiled
>>>> 2019-03-25 07:48:33,145-04
DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, params:
[GetClusterByName, NameQueryParameters:{refresh='false',
filtered='false'}], timeElapsed: 7ms
>>>> 2019-03-25 07:48:33,161-04
DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] Compiled stored procedure. Call string is [{call
getvmstaticbyname(?)}]
>>>> 2019-03-25 07:48:33,161-04
DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
>>>> 2019-03-25 07:48:33,206-04
DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getvdsbyname(?)}]
>>>> 2019-03-25 07:48:33,206-04
DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByName] compiled
>>>> 2019-03-25 07:48:33,223-04
DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 20ms
>>>> 2019-03-25 07:48:33,225-04
DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getvdsbyhostname(?)}]
>>>> 2019-03-25 07:48:33,225-04
DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByHostName] compiled
>>>> 2019-03-25 07:48:33,252-04
DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default
task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getAllForHostname, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 28ms
>>>> 2019-03-25 07:48:33,254-04
DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getstorage_poolsbyclusterid(?)}]
>>>> 2019-03-25 07:48:33,254-04
DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[Getstorage_poolsByClusterId] compiled
>>>> 2019-03-25 07:48:33,262-04
DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Connecting
'root@lago-basic-suite-4-3-host-1'
>>>> 2019-03-25 07:48:33,461-04
DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Connected:
'root@lago-basic-suite-4-3-host-1'
>>>> 2019-03-25 07:48:33,467-04
DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating:
'root@lago-basic-suite-4-3-host-1'
>>>> 2019-03-25 07:48:33,599-04
DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated:
'root@lago-basic-suite-4-3-host-1'
>>>> 2019-03-25 07:48:33,600-04
DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool vdsm-id'
>>>> 2019-03-25 07:48:33,902-04
WARN [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request on host: Command
returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1'
>>>> 2019-03-25 07:48:33,920-04
INFO [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false.
Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group
CREATE_HOST with role type ADMIN
>>>>
>>>>
>>>>
>>>> On Mon, Mar 25, 2019
>>>> at 12:53 PM Dafna Ron
>>>> <dron(a)redhat.com
>>>>
<mailto:dron@redhat.com>>
>>>> wrote:
>>>>
>>>> we have another
>>>> failing patch with
>>>> the same test:
>>>>
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/
>>>>
>>>> its obviously not
>>>> related to the
>>>> patch but there is
>>>> something that is
>>>> causing these
>>>> failures randomly.
>>>> From what I can
>>>> see in the current
>>>> failed job, you
>>>> are correct and
>>>> host-1 and engine
>>>> does not even try
>>>> to deploy host-1.
>>>>
>>>> i can see host-1
>>>> is getting error
>>>> 127 in lago for
>>>> ntpd and I can see
>>>> network manager
>>>> errors in the host
>>>> messages log
>>>>
>>>> In the host
>>>> messages log I can
>>>> see several
>>>> messages that I
>>>> suspect may cause
>>>> issues in
>>>> communication
>>>> between engine and
>>>> host:
>>>>
>>>> Mar 25 07:50:09
>>>>
lago-basic-suite-4-3-host-1
>>>> sasldblistusers2:
>>>> _sasldb_getkeyhandle
>>>> has failed
>>>> Mar 25 07:50:10
>>>>
lago-basic-suite-4-3-host-1
>>>> saslpasswd2: error
>>>> deleting entry
>>>> from sasldb:
>>>> BDB0073
>>>> DB_NOTFOUND: No
>>>> matching key/data
>>>> pair found
>>>> Mar 25 07:50:10
>>>>
lago-basic-suite-4-3-host-1
>>>> saslpasswd2: error
>>>> deleting entry
>>>> from sasldb:
>>>> BDB0073
>>>> DB_NOTFOUND: No
>>>> matching key/data
>>>> pair found
>>>> Mar 25 07:50:10
>>>>
lago-basic-suite-4-3-host-1
>>>> saslpasswd2: error
>>>> deleting entry
>>>> from sasldb:
>>>> BDB0073
>>>> DB_NOTFOUND: No
>>>> matching key/data
>>>> pair found
>>>>
>>>>
>>>> ces/12)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel: bonding:
>>>> bondscan-UMJa2S is
>>>> being created...
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel: bonding:
>>>> bondscan-UMJa2S is
>>>> being deleted...
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> NetworkManager[2658]:
>>>> <info>
>>>> [1553514613.7774]
>>>> manager:
>>>> (bondscan-UMJa2S):
>>>> new Bond device
>>>>
(/org/freedesktop/NetworkManager/Devices/13)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> (unregistered
>>>> net_device)
>>>> (unregistering):
>>>> Released all slaves
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel: bonding:
>>>> bondscan-liwvMR is
>>>> being created...
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option
>>>> fail_over_mac:
>>>> invalid value (3)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option ad_select:
>>>> invalid value (3)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option lacp_rate:
>>>> mode dependency
>>>> failed, not
>>>> supported in mode
>>>> balance-rr(0)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option
>>>> arp_validate:
>>>> invalid value (7)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option
>>>> xmit_hash_policy:
>>>> invalid value (5)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> NetworkManager[2658]:
>>>> <info>
>>>> [1553514613.8002]
>>>> manager:
>>>> (bondscan-liwvMR):
>>>> new Bond device
>>>>
(/org/freedesktop/NetworkManager/Devices/14)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option
>>>> primary_reselect:
>>>> invalid value (3)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option
>>>> arp_all_targets:
>>>> invalid value (2)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel: bonding:
>>>> bondscan-liwvMR is
>>>> being deleted...
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> (unregistered
>>>> net_device)
>>>> (unregistering):
>>>> Released all slaves
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel: bonding:
>>>> bondscan-liwvMR is
>>>> being created...
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option
>>>> fail_over_mac:
>>>> invalid value (3)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option ad_select:
>>>> invalid value (3)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option lacp_rate:
>>>> mode dependency
>>>> failed, not
>>>> supported in mode
>>>> active-backup(1)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option
>>>> arp_validate:
>>>> invalid value (7)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> NetworkManager[2658]:
>>>> <info>
>>>> [1553514613.8429]
>>>> manager:
>>>> (bondscan-liwvMR):
>>>> new Bond device
>>>>
(/org/freedesktop/NetworkManager/Devices/15)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option
>>>> xmit_hash_policy:
>>>> invalid value (5)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option
>>>> primary_reselect:
>>>> invalid value (3)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> bondscan-liwvMR:
>>>> option
>>>> arp_all_targets:
>>>> invalid value (2)
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel: bonding:
>>>> bondscan-liwvMR is
>>>> being deleted...
>>>> Mar 25 07:50:13
>>>>
lago-basic-suite-4-3-host-1
>>>> kernel:
>>>> (unregistered
>>>> net_device)
>>>> (unregistering):
>>>> Released all slaves
>>>>
>>>>
>>>> On Mon, Mar 25,
>>>> 2019 at 11:24 AM
>>>> Marcin Sobczyk
>>>> <msobczyk(a)redhat.com
>>>>
<mailto:msobczyk@redhat.com>>
>>>> wrote:
>>>>
>>>> For the failed
>>>> job, the
>>>> engine didn't
>>>> even try to
>>>> deploy on host-1:
>>>>
>>>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/...
>>>>
>>>> Martin, do you
>>>> know what
>>>> could be the
>>>> reason for that?
>>>>
>>>> I can see in
>>>> the logs for
>>>> both
>>>> successful and
>>>> unsuccessful
>>>> basic-suite-4.3
>>>> runs, that
>>>> there is no
>>>> 'ntpdate' on
>>>> host-1:
>>>>
>>>> 2019-03-25
10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 on
lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine
>>>> 2019-03-25
10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 returned with 127
>>>> 2019-03-25
10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 errors:
>>>> bash: ntpdate:
command not found
>>>>
>>>> On host-0
>>>> everything is ok:
>>>>
>>>> 2019-03-25
10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 on
lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine
>>>> 2019-03-25
10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 returned with 0
>>>> 2019-03-25
10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 output:
>>>> 25 Mar 06:14:53
ntpdate[6646]: adjust time server 192.168.202.2 offset 0.017150 sec
>>>>
>>>> On 3/25/19
>>>> 10:13 AM, Eyal
>>>> Edri wrote:
>>>>> Still fails,
>>>>> now on a
>>>>> different
>>>>> component. (
>>>>>
ovirt-web-ui-extentions
>>>>> )
>>>>>
>>>>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/
>>>>>
>>>>> On Fri, Mar
>>>>> 22, 2019 at
>>>>> 3:59 PM Dan
>>>>> Kenigsberg
>>>>>
<danken(a)redhat.com
>>>>>
<mailto:danken@redhat.com>>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Fri,
>>>>> Mar 22,
>>>>> 2019 at
>>>>> 3:21 PM
>>>>> Marcin
>>>>> Sobczyk
>>>>>
<msobczyk(a)redhat.com
>>>>>
<mailto:msobczyk@redhat.com>>
>>>>> wrote:
>>>>>
>>>>> Dafna,
>>>>>
>>>>> in
>>>>>
'verify_add_hosts'
>>>>> we
>>>>>
specifically
>>>>> wait
>>>>> for
>>>>> single
>>>>> host
>>>>> to be
>>>>> up
>>>>> with
>>>>> a
>>>>> timeout:
>>>>>
>>>>> 144
up_hosts = hosts_service.list(search='datacenter={} AND
status=up'.format(DC_NAME))
>>>>> 145
if len(up_hosts):
>>>>>
146 return True
>>>>>
>>>>> The
>>>>> log
>>>>> files
>>>>> say,
>>>>> that
>>>>> it
>>>>> took
>>>>> ~50
>>>>> secs
>>>>> for
>>>>> one
>>>>> of
>>>>> the
>>>>> hosts
>>>>> to be
>>>>> up
>>>>> (seems
>>>>>
reasonable)
>>>>> and
>>>>> no
>>>>> timeout
>>>>> is
>>>>> being
>>>>>
reported.
>>>>> Just
>>>>> after
>>>>> running
>>>>>
'verify_add_hosts',
>>>>> we
>>>>> run
>>>>>
'add_master_storage_domain',
>>>>> which
>>>>> calls
>>>>>
'_hosts_in_dc'
>>>>>
function.
>>>>> That
>>>>> function
>>>>> does
>>>>> the
>>>>> exact
>>>>> same
>>>>> check,
>>>>> but
>>>>> it
>>>>> fails:
>>>>>
>>>>> 113
hosts = hosts_service.list(search='datacenter={} AND status=up'.format(dc_name))
>>>>> 114
if hosts:
>>>>>
115 if random_host:
>>>>>
116 return random.choice(hosts)
>>>>>
>>>>> I don't
>>>>> think it
>>>>> is
>>>>> relevant
>>>>> to our
>>>>> current
>>>>> failure;
>>>>> but I
>>>>> consider
>>>>>
random_host=True
>>>>> as a bad
>>>>> practice.
>>>>> As if we
>>>>> do not
>>>>> have
>>>>> enough
>>>>> moving
>>>>> parts, we
>>>>> are
>>>>> adding
>>>>> intentional
>>>>> randomness.
>>>>>
Reproducibility
>>>>> is far
>>>>> more
>>>>> important
>>>>> than
>>>>> coverage
>>>>> -
>>>>> particularly
>>>>> for a
>>>>> shared
>>>>> system
>>>>> test like
>>>>> OST.
>>>>>
>>>>>
117 else:
>>>>>
118 return sorted(hosts, key=lambda host:host.name
<
http://host.name>)
>>>>> 119
raise RuntimeError('Could not find hosts that are up in DC %s' % dc_name)
>>>>>
>>>>> I'm
>>>>> also
>>>>> not
>>>>> able
>>>>> to
>>>>>
reproduce
>>>>> this
>>>>> issue
>>>>> locally
>>>>> on my
>>>>> server.
>>>>> The
>>>>>
investigation
>>>>>
continues...
>>>>>
>>>>>
>>>>> I think
>>>>> that it
>>>>> would be
>>>>> fair to
>>>>> take the
>>>>> filtering
>>>>> by host
>>>>> state out
>>>>> of Engine
>>>>> and into
>>>>> the test,
>>>>> where we
>>>>> can
>>>>> easily
>>>>> log the
>>>>> current
>>>>> status of
>>>>> each
>>>>> host.
>>>>> Then
we'd
>>>>> have
>>>>> better
>>>>>
understanding
>>>>> on the
>>>>> next
failure.
>>>>>
>>>>> On
>>>>> 3/22/19
>>>>> 1:17
>>>>> PM,
>>>>> Marcin
>>>>> Sobczyk
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
sure,
>>>>>>
I'm
>>>>>> on
>>>>>> it -
>>>>>>
it's
>>>>>>
weird
>>>>>>
though,
>>>>>> I
>>>>>> did
>>>>>> ran
>>>>>> 4.3
>>>>>>
basic
>>>>>>
suite
>>>>>> for
>>>>>> this
>>>>>>
patch
>>>>>>
manually
>>>>>> and
>>>>>>
everything
>>>>>> was
ok.
>>>>>>
>>>>>> On
>>>>>>
3/22/19
>>>>>> 1:05
>>>>>> PM,
>>>>>>
Dafna
>>>>>> Ron
>>>>>>
wrote:
>>>>>>>
Hi,
>>>>>>>
>>>>>>>
We
>>>>>>>
are
>>>>>>>
failing
>>>>>>>
branch
>>>>>>>
4.3
>>>>>>>
for
>>>>>>>
test:
>>>>>>>
002_bootstrap.add_master_storage_domain
>>>>>>>
>>>>>>>
It
>>>>>>>
seems
>>>>>>>
that
>>>>>>>
in
>>>>>>>
one
>>>>>>>
of
>>>>>>>
the
>>>>>>>
hosts,
>>>>>>>
the
>>>>>>>
vdsm
>>>>>>>
is
>>>>>>>
not
>>>>>>>
starting
>>>>>>>
there
>>>>>>>
is
>>>>>>>
nothing
>>>>>>>
in
>>>>>>>
vdsm.log
>>>>>>>
or
>>>>>>>
in
>>>>>>>
supervdsm.log
>>>>>>>
>>>>>>>
CQ
>>>>>>>
identified
>>>>>>>
this
>>>>>>>
patch
>>>>>>>
as
>>>>>>>
the
>>>>>>>
suspected
>>>>>>>
root
>>>>>>>
cause:
>>>>>>>
>>>>>>>
https://gerrit.ovirt.org/#/c/98748/
>>>>>>>
-
>>>>>>>
vdsm:
>>>>>>>
client:
>>>>>>>
Add
>>>>>>>
support
>>>>>>>
for
>>>>>>>
flow id
>>>>>>>
>>>>>>>
Milan,
>>>>>>>
Marcin,
>>>>>>>
can
>>>>>>>
you
>>>>>>>
please
>>>>>>>
have
>>>>>>>
a
>>>>>>>
look?
>>>>>>>
>>>>>>>
full
>>>>>>>
logs:
>>>>>>>
>>>>>>>
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/b...
>>>>>>>
>>>>>>>
the
>>>>>>>
only
>>>>>>>
error
>>>>>>>
I
>>>>>>>
can
>>>>>>>
see
>>>>>>>
is
>>>>>>>
about
>>>>>>>
host
>>>>>>>
not
>>>>>>>
being
>>>>>>>
up
>>>>>>>
(makes
>>>>>>>
sense
>>>>>>>
as
>>>>>>>
vdsm
>>>>>>>
is
>>>>>>>
not
>>>>>>>
running)
>>>>>>>
>>>>>>>
>>>>>>>
Stacktrace
>>>>>>>
>>>>>>>
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
>>>>>>>
testMethod()
>>>>>>>
File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
>>>>>>>
self.test(*self.arg)
>>>>>>>
File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in
wrapped_test
>>>>>>>
test()
>>>>>>>
File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in
wrapper
>>>>>>>
return func(get_test_prefix(), *args, **kwargs)
>>>>>>>
File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 417, in add_master_storage_domain
>>>>>>>
add_iscsi_storage_domain(prefix)
>>>>>>>
File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 561, in add_iscsi_storage_domain
>>>>>>>
host=_random_host_from_dc(api, DC_NAME),
>>>>>>>
File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 122, in _random_host_from_dc
>>>>>>>
return _hosts_in_dc(api, dc_name, True)
>>>>>>>
File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 119, in _hosts_in_dc
>>>>>>>
raise RuntimeError('Could not find hosts that are up in DC %s' % dc_name)
>>>>>>>
'Could not find hosts that are up in DC test-dc\n-------------------- >> begin
captured logging << --------------------\nlago.ssh: DEBUG:
starttask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG:
endtask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on
lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine returned with 0\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine output:\n
3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
>>>>>
_______________________________________________
>>>>> Devel
>>>>> mailing
>>>>> list
>>>>> --
>>>>>
devel(a)ovirt.org
>>>>>
<mailto:devel@ovirt.org>
>>>>> To
>>>>>
unsubscribe
>>>>> send
>>>>> an
>>>>> email
>>>>> to
>>>>>
devel-leave(a)ovirt.org
>>>>>
<mailto:devel-leave@ovirt.org>
>>>>> Privacy
>>>>>
Statement:
>>>>>
https://www.ovirt.org/site/privacy-policy/
>>>>> oVirt
>>>>> Code
>>>>> of
>>>>> Conduct:
>>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>>> List
>>>>>
Archives:
>>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZY...
>>>>>
>>>>>
_______________________________________________
>>>>> Devel
>>>>> mailing
>>>>> list --
>>>>>
devel(a)ovirt.org
>>>>>
<mailto:devel@ovirt.org>
>>>>> To
>>>>> unsubscribe
>>>>> send an
>>>>> email to
>>>>>
devel-leave(a)ovirt.org
>>>>>
<mailto:devel-leave@ovirt.org>
>>>>> Privacy
>>>>> Statement:
>>>>>
https://www.ovirt.org/site/privacy-policy/
>>>>> oVirt
>>>>> Code of
>>>>> Conduct:
>>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>>> List
>>>>> Archives:
>>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZF...
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Eyal edri
>>>>>
>>>>>
>>>>> MANAGER
>>>>>
>>>>> RHV/CNV DevOps
>>>>>
>>>>> EMEA
>>>>> VIRTUALIZATION
>>>>> R&D
>>>>>
>>>>>
>>>>> Red Hat EMEA
>>>>>
<
https://www.redhat.com/>
>>>>>
>>>>>
<
https://red.ht/sig>
>>>>> TRIED.
>>>>> TESTED.
>>>>> TRUSTED.
>>>>>
<
https://redhat.com/trusted>
>>>>>
>>>>>
>>>>> phone:
>>>>> +972-9-7692018
>>>>> irc: eedri
>>>>> (on #tlv
>>>>> #rhev-dev
>>>>> #rhev-integ)
>>>>
_______________________________________________
>>>> Devel mailing
>>>> list --
>>>> devel(a)ovirt.org
>>>>
<mailto:devel@ovirt.org>
>>>> To unsubscribe
>>>> send an email
>>>> to
>>>>
devel-leave(a)ovirt.org
>>>>
<mailto:devel-leave@ovirt.org>
>>>> Privacy
>>>> Statement:
>>>>
https://www.ovirt.org/site/privacy-policy/
>>>> oVirt Code of
>>>> Conduct:
>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUV...
>>>>
>
>
> --
>
> SANDRO BONAZZOLA
>
> MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
>
> Red Hat EMEA <
https://www.redhat.com/>
>
> sbonazzo(a)redhat.com
> <mailto:sbonazzo@redhat.com>
>
> <
https://red.ht/sig>
>
> _______________________________________________
> Devel mailing list -- devel(a)ovirt.org
> <mailto:devel@ovirt.org>
> To unsubscribe send an email to
> devel-leave(a)ovirt.org
> <mailto:devel-leave@ovirt.org>
> Privacy Statement:
>
https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
>
https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SR7YJIIZHM3...
>
>
>
> --
> <div dir="ltr"
>
class="gmail-m_-929359959023254178gmail-m_-3007985368740475924gmail-m_-3898735433743
>