This is 'httpd' log file from the most recent job that failed
'add_master_storage_domain'.
At the end of the log file, we can see:
192.168.202.1 - - [28/Mar/2019:11:05:38 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 11845
192.168.202.1 - - [28/Mar/2019:11:05:41 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 11847
192.168.202.1 - - [28/Mar/2019:11:05:44 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 11847
192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 5959
192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 200
7826
192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET /ovirt-engine/api HTTP/1.1"
200 5112
192.168.202.1 - - [28/Mar/2019:11:05:48 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 200 65
192.168.202.2 - - [28/Mar/2019:11:05:50 -0400] "POST
/ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 223
The first batch of 'status=installing or status=initializing' requests
and the first 'status=up' request is 'verify_add_hosts'.
The second 'status=up' request is the one that comes from '_hosts_in_dc'
call from 'add_master_storage_domain'.
They are executed with a 1 second time difference, yet the results are
not the same - 7826 bytes vs 65 bytes.
I went through every line of 'engine.log' from that 1 sec interval and
couldn't find anything, that would indicate the host is going down.
I've asked Ravi for help yesterday and I've found out, that the results
of these requests are queries to our postgres db and they are somehow
cached.
I see 2 options here - either something really changed the db contents
during that 1 second or there's a bug in our API/cache that returns the
wrong info.
I think we can eliminate the first option by adding collection of
postgres db to our OST suites and looking at it's history after a failure.
I don't really have a solution for the second possibility - any help
from someone who knows engine's code better would be appreciated.
On 3/26/19 5:37 PM, Dafna Ron wrote:
We never remove any packages from tested :)
I am actually suspecting a problem that is caused by something that is
shared by all projects like the api.
we currently have 3 different tests that are randomly failing because
of engine/host communication and I am starting to think they are all
connected.
On Tue, Mar 26, 2019 at 2:03 PM Eyal Edri <eedri(a)redhat.com
<mailto:eedri@redhat.com>> wrote:
On Tue, Mar 26, 2019 at 3:56 PM Sandro Bonazzola
<sbonazzo(a)redhat.com <mailto:sbonazzo@redhat.com>> wrote:
Il giorno mar 26 mar 2019 alle ore 14:42 Marcin Sobczyk
<msobczyk(a)redhat.com <mailto:msobczyk@redhat.com>> ha scritto:
Ok, maybe I've found something. I was comparing two sets
of log files - one for basic-suite-4.3 run that succeeded
and one that did not.
The deployment procedure for host-1 is not complete in the
unsuccessful case. The last log entry is:
2019-03-22 17:54:49,973-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting vdsm.
In the successful case we can see:
2019-03-25 06:16:31,789-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting vdsm.
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser]
(VdsDeploy) [3add0cac] Got: **%EventEnd STAGE closeup METHOD
otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start
(odeploycons.packages.vdsm.started)
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser]
(VdsDeploy) [3add0cac] Got: **%EventStart STAGE closeup METHOD
otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser]
(VdsDeploy) [3add0cac] Got: ***L:INFO Starting ovirt-vmconsole-host-sshd
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser]
(VdsDeploy) [3add0cac] nextEvent: Log INFO Starting ovirt-vmconsole-host-sshd
2019-03-25 06:16:38,261-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd.
2019-03-25 06:16:39,479-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser]
(VdsDeploy) [3add0cac] Got: **%EventEnd STAGE closeup METHOD
otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
...
It looks like ovirt-host-deploy plugin did not respond?
The timing of:
https://gerrit.ovirt.org/#/c/98762/
kinda aligns with my patch which was merged for vdsm-4.3
and was a suspect:
https://gerrit.ovirt.org/#/c/98748/
Sandro, are you aware of any issues on ovirt-host-deploy
plugin that might have caused these?
I don't really see how a patch touching the automation
configuratin for STDCI v2 can be the cause of this but please
go ahead and blacklist last build.
It should differ from previous one only by the timestamp of
the rpm :-)
Ehhhh, yeah, I'm so tired of looking at log files I'm starting to see
patterns everywhere...
If you want to verify a specific version of an RPM works , you can
use the manual job for it.
Dafna, can we temporarily blacklist the latest version of
ovirt-host-deploy to see if it fixes CQ?
Marcin
On 3/25/19 2:45 PM, Dafna Ron wrote:
> yes. I suspect this is the problem.
> In the host I can see in the yum log the times the vdsm
> packages are installed:
> [dron@dron post-002_bootstrap.py]$ less
> lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm
> Mar 25 07:48:49 Installed:
> vdsm-api-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:48:52 Installed:
> vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:00 Installed:
> vdsm-common-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:00 Installed:
> vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:38 Installed:
> vdsm-network-4.30.11-23.git276b602.el7.x86_64
> Mar 25 07:49:38 Installed:
> vdsm-python-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:39 Installed:
> vdsm-client-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:39 Installed:
> vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:39 Installed:
> vdsm-http-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:42 Installed:
> vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:57 Installed:
> vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:58 Installed:
> vdsm-4.30.11-23.git276b602.el7.x86_64
> Mar 25 07:49:58 Installed:
> vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch
> Mar 25 07:49:58 Installed:
> vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch
>
> and in the engine you can see that the engine already
> failed installation:
> 2019-03-25 07:48:33,920-04 INFO
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false.
Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group
CREATE_HOST with role type ADMIN
>
> On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron
> <dron(a)redhat.com <mailto:dron@redhat.com>> wrote:
>
> I am also adding didi
> should this not be logged in host-deploy?
>
>
> On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk
> <msobczyk(a)redhat.com <mailto:msobczyk@redhat.com>>
wrote:
>
> Indeed in my failed job, I can see, that
> 'vdsm-python' is installed *after* engine's
> attempt to run 'vdsm-tool':
>
> lago-basic-suite-4-3-host-1/_var_log/yum.log
> 403:Mar 22 17:54:12 Installed:
vdsm-python-4.30.11-23.git276b602.el7.noarch
>
> lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log
> 1580:2019-03-22 17:53:03,793-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool vdsm-id'
> 1581:2019-03-22 17:53:04,043-04 WARN
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id request on host: Command
returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1'
>
>
> On 3/25/19 2:22 PM, Dafna Ron wrote:
>> and its only failing on one host as well.
>>
>>
>> On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk
>> <msobczyk(a)redhat.com
>> <mailto:msobczyk@redhat.com>> wrote:
>>
>> Hmmm, that would suggest, that 'vdsm-tool'
>> command is not found on the host. There was
>> no work done around 'vdsm-tool's
'vdsm-id'
>> recently.
>>
>> On 3/25/19 2:08 PM, Dafna Ron wrote:
>>> Actually, I can see in engine that it is
>>> trying to start vdsm installation on the host:
>>>
>>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] Entered
SsoRestApiAuthFilter
>>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter authenticating with sso
>>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter authenticating using BEARER header
>>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter successfully authenticated using BEARER header
>>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) []
Entered SsoRestApiNegotiationFilter
>>> 2019-03-25 07:48:33,124-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) []
SsoRestApiNegotiationFilter Not performing Negotiate Auth
>>> 2019-03-25 07:48:33,143-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getclusterforuserbyclustername(?, ?, ?)}]
>>> 2019-03-25 07:48:33,143-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetClusterForUserByClusterName] compiled
>>> 2019-03-25 07:48:33,145-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, params: [GetClusterByName,
NameQueryParameters:{refresh='false', filtered='false'}], timeElapsed:
7ms
>>> 2019-03-25 07:48:33,161-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] Compiled stored procedure. Call string is [{call
getvmstaticbyname(?)}]
>>> 2019-03-25 07:48:33,161-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
>>> 2019-03-25 07:48:33,206-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getvdsbyname(?)}]
>>> 2019-03-25 07:48:33,206-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByName] compiled
>>> 2019-03-25 07:48:33,223-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 20ms
>>> 2019-03-25 07:48:33,225-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getvdsbyhostname(?)}]
>>> 2019-03-25 07:48:33,225-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByHostName] compiled
>>> 2019-03-25 07:48:33,252-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: getAllForHostname, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 28ms
>>> 2019-03-25 07:48:33,254-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getstorage_poolsbyclusterid(?)}]
>>> 2019-03-25 07:48:33,254-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[Getstorage_poolsByClusterId] compiled
>>> 2019-03-25 07:48:33,262-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Connecting
'root@lago-basic-suite-4-3-host-1'
>>> 2019-03-25 07:48:33,461-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Connected:
'root@lago-basic-suite-4-3-host-1'
>>> 2019-03-25 07:48:33,467-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating:
'root@lago-basic-suite-4-3-host-1'
>>> 2019-03-25 07:48:33,599-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated:
'root@lago-basic-suite-4-3-host-1'
>>> 2019-03-25 07:48:33,600-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool vdsm-id'
>>> 2019-03-25 07:48:33,902-04 WARN
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request on host: Command
returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1'
>>> 2019-03-25 07:48:33,920-04 INFO
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false.
Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group
CREATE_HOST with role type ADMIN
>>>
>>>
>>>
>>> On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron
>>> <dron(a)redhat.com
<mailto:dron@redhat.com>>
>>> wrote:
>>>
>>> we have another failing patch with the
>>> same test:
>>>
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/
>>>
>>> its obviously not related to the patch
>>> but there is something that is causing
>>> these failures randomly.
>>> From what I can see in the current
>>> failed job, you are correct and host-1
>>> and engine does not even try to deploy
>>> host-1.
>>>
>>> i can see host-1 is getting error 127
>>> in lago for ntpd and I can see network
>>> manager errors in the host messages log
>>>
>>> In the host messages log I can see
>>> several messages that I suspect may
>>> cause issues in communication between
>>> engine and host:
>>>
>>> Mar 25 07:50:09
>>> lago-basic-suite-4-3-host-1
>>> sasldblistusers2: _sasldb_getkeyhandle
>>> has failed
>>> Mar 25 07:50:10
>>> lago-basic-suite-4-3-host-1
>>> saslpasswd2: error deleting entry from
>>> sasldb: BDB0073 DB_NOTFOUND: No
>>> matching key/data pair found
>>> Mar 25 07:50:10
>>> lago-basic-suite-4-3-host-1
>>> saslpasswd2: error deleting entry from
>>> sasldb: BDB0073 DB_NOTFOUND: No
>>> matching key/data pair found
>>> Mar 25 07:50:10
>>> lago-basic-suite-4-3-host-1
>>> saslpasswd2: error deleting entry from
>>> sasldb: BDB0073 DB_NOTFOUND: No
>>> matching key/data pair found
>>>
>>>
>>> ces/12)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bonding: bondscan-UMJa2S is being
>>> created...
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bonding: bondscan-UMJa2S is being
>>> deleted...
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1
>>> NetworkManager[2658]: <info>
>>> [1553514613.7774] manager:
>>> (bondscan-UMJa2S): new Bond device
>>> (/org/freedesktop/NetworkManager/Devices/13)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> (unregistered net_device)
>>> (unregistering): Released all slaves
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bonding: bondscan-liwvMR is being
>>> created...
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option fail_over_mac:
>>> invalid value (3)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option ad_select:
>>> invalid value (3)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option lacp_rate: mode
>>> dependency failed, not supported in
>>> mode balance-rr(0)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option arp_validate:
>>> invalid value (7)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option
>>> xmit_hash_policy: invalid value (5)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1
>>> NetworkManager[2658]: <info>
>>> [1553514613.8002] manager:
>>> (bondscan-liwvMR): new Bond device
>>> (/org/freedesktop/NetworkManager/Devices/14)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option
>>> primary_reselect: invalid value (3)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option
>>> arp_all_targets: invalid value (2)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bonding: bondscan-liwvMR is being
>>> deleted...
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> (unregistered net_device)
>>> (unregistering): Released all slaves
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bonding: bondscan-liwvMR is being
>>> created...
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option fail_over_mac:
>>> invalid value (3)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option ad_select:
>>> invalid value (3)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option lacp_rate: mode
>>> dependency failed, not supported in
>>> mode active-backup(1)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option arp_validate:
>>> invalid value (7)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1
>>> NetworkManager[2658]: <info>
>>> [1553514613.8429] manager:
>>> (bondscan-liwvMR): new Bond device
>>> (/org/freedesktop/NetworkManager/Devices/15)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option
>>> xmit_hash_policy: invalid value (5)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option
>>> primary_reselect: invalid value (3)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bondscan-liwvMR: option
>>> arp_all_targets: invalid value (2)
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> bonding: bondscan-liwvMR is being
>>> deleted...
>>> Mar 25 07:50:13
>>> lago-basic-suite-4-3-host-1 kernel:
>>> (unregistered net_device)
>>> (unregistering): Released all slaves
>>>
>>>
>>> On Mon, Mar 25, 2019 at 11:24 AM Marcin
>>> Sobczyk <msobczyk(a)redhat.com
>>> <mailto:msobczyk@redhat.com>> wrote:
>>>
>>> For the failed job, the engine
>>> didn't even try to deploy on host-1:
>>>
>>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/...
>>>
>>> Martin, do you know what could be
>>> the reason for that?
>>>
>>> I can see in the logs for both
>>> successful and unsuccessful
>>> basic-suite-4.3 runs, that there is
>>> no 'ntpdate' on host-1:
>>>
>>> 2019-03-25
10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 on
lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine
>>> 2019-03-25
10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 returned with 127
>>> 2019-03-25
10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 errors:
>>> bash: ntpdate: command not found
>>>
>>> On host-0 everything is ok:
>>>
>>> 2019-03-25
10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 on
lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine
>>> 2019-03-25
10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 returned with 0
>>> 2019-03-25
10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 output:
>>> 25 Mar 06:14:53 ntpdate[6646]: adjust time
server 192.168.202.2 offset 0.017150 sec
>>>
>>> On 3/25/19 10:13 AM, Eyal Edri wrote:
>>>> Still fails, now on a different
>>>> component. (
>>>> ovirt-web-ui-extentions )
>>>>
>>>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/
>>>>
>>>> On Fri, Mar 22, 2019 at 3:59 PM
>>>> Dan Kenigsberg <danken(a)redhat.com
>>>> <mailto:danken@redhat.com>>
wrote:
>>>>
>>>>
>>>>
>>>> On Fri, Mar 22, 2019 at 3:21
>>>> PM Marcin Sobczyk
>>>> <msobczyk(a)redhat.com
>>>>
<mailto:msobczyk@redhat.com>>
>>>> wrote:
>>>>
>>>> Dafna,
>>>>
>>>> in 'verify_add_hosts' we
>>>> specifically wait for
>>>> single host to be up with
>>>> a timeout:
>>>>
>>>> 144 up_hosts =
hosts_service.list(search='datacenter={} AND status=up'.format(DC_NAME))
>>>> 145 if len(up_hosts):
>>>> 146 return True
>>>>
>>>> The log files say, that it
>>>> took ~50 secs for one of
>>>> the hosts to be up (seems
>>>> reasonable) and no timeout
>>>> is being reported.
>>>> Just after running
>>>> 'verify_add_hosts', we
run
>>>>
'add_master_storage_domain',
>>>> which calls
'_hosts_in_dc'
>>>> function.
>>>> That function does the
>>>> exact same check, but it
>>>> fails:
>>>>
>>>> 113 hosts =
hosts_service.list(search='datacenter={} AND status=up'.format(dc_name))
>>>> 114 if hosts:
>>>> 115 if random_host:
>>>> 116 return
random.choice(hosts)
>>>>
>>>> I don't think it is relevant
>>>> to our current failure; but I
>>>> consider random_host=True as a
>>>> bad practice. As if we do not
>>>> have enough moving parts, we
>>>> are adding intentional
>>>> randomness. Reproducibility is
>>>> far more important than
>>>> coverage - particularly for a
>>>> shared system test like OST.
>>>>
>>>> 117 else:
>>>> 118 return
sorted(hosts, key=lambda host:host.name <
http://host.name>)
>>>> 119 raise
RuntimeError('Could not find hosts that are up in DC %s' % dc_name)
>>>>
>>>> I'm also not able to
>>>> reproduce this issue
>>>> locally on my server. The
>>>> investigation continues...
>>>>
>>>>
>>>> I think that it would be fair
>>>> to take the filtering by host
>>>> state out of Engine and into
>>>> the test, where we can easily
>>>> log the current status of each
>>>> host. Then we'd have better
>>>> understanding on the next failure.
>>>>
>>>> On 3/22/19 1:17 PM, Marcin
>>>> Sobczyk wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> sure, I'm on it -
it's
>>>>> weird though, I did ran
>>>>> 4.3 basic suite for this
>>>>> patch manually and
>>>>> everything was ok.
>>>>>
>>>>> On 3/22/19 1:05 PM, Dafna
>>>>> Ron wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We are failing branch
>>>>>> 4.3 for test:
>>>>>>
002_bootstrap.add_master_storage_domain
>>>>>>
>>>>>> It seems that in one of
>>>>>> the hosts, the vdsm is
>>>>>> not starting
>>>>>> there is nothing in
>>>>>> vdsm.log or in
supervdsm.log
>>>>>>
>>>>>> CQ identified this patch
>>>>>> as the suspected root
>>>>>> cause:
>>>>>>
>>>>>>
https://gerrit.ovirt.org/#/c/98748/
>>>>>> - vdsm: client: Add
>>>>>> support for flow id
>>>>>>
>>>>>> Milan, Marcin, can you
>>>>>> please have a look?
>>>>>>
>>>>>> full logs:
>>>>>>
>>>>>>
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/b...
>>>>>>
>>>>>> the only error I can see
>>>>>> is about host not being
>>>>>> up (makes sense as vdsm
>>>>>> is not running)
>>>>>>
>>>>>>
>>>>>> Stacktrace
>>>>>>
>>>>>> File
"/usr/lib64/python2.7/unittest/case.py", line 369, in run
>>>>>> testMethod()
>>>>>> File
"/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
>>>>>>
self.test(*self.arg)
>>>>>> File
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in
wrapped_test
>>>>>> test()
>>>>>> File
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper
>>>>>> return
func(get_test_prefix(), *args, **kwargs)
>>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 417, in add_master_storage_domain
>>>>>>
add_iscsi_storage_domain(prefix)
>>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 561, in add_iscsi_storage_domain
>>>>>>
host=_random_host_from_dc(api, DC_NAME),
>>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 122, in _random_host_from_dc
>>>>>> return
_hosts_in_dc(api, dc_name, True)
>>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 119, in _hosts_in_dc
>>>>>> raise
RuntimeError('Could not find hosts that are up in DC %s' % dc_name)
>>>>>> 'Could not find hosts
that are up in DC test-dc\n-------------------- >> begin captured logging <<
--------------------\nlago.ssh: DEBUG: starttask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get
ssh client for lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG:
endtask:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on
lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine returned with 0\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine output:\n
3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
>>>>
_______________________________________________
>>>> Devel mailing list --
>>>> devel(a)ovirt.org
>>>> <mailto:devel@ovirt.org>
>>>> To unsubscribe send an
>>>> email to
>>>> devel-leave(a)ovirt.org
>>>>
<mailto:devel-leave@ovirt.org>
>>>> Privacy Statement:
>>>>
https://www.ovirt.org/site/privacy-policy/
>>>> oVirt Code of Conduct:
>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZY...
>>>>
>>>>
_______________________________________________
>>>> Devel mailing list --
>>>> devel(a)ovirt.org
>>>> <mailto:devel@ovirt.org>
>>>> To unsubscribe send an email
>>>> to devel-leave(a)ovirt.org
>>>> <mailto:devel-leave@ovirt.org>
>>>> Privacy Statement:
>>>>
https://www.ovirt.org/site/privacy-policy/
>>>> oVirt Code of Conduct:
>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZF...
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Eyal edri
>>>>
>>>>
>>>> MANAGER
>>>>
>>>> RHV/CNV DevOps
>>>>
>>>> EMEA VIRTUALIZATION R&D
>>>>
>>>>
>>>> Red Hat EMEA
<
https://www.redhat.com/>
>>>>
>>>> <
https://red.ht/sig> TRIED.
>>>> TESTED. TRUSTED.
>>>> <
https://redhat.com/trusted>
>>>>
>>>> phone: +972-9-7692018
>>>> irc: eedri (on #tlv #rhev-dev
>>>> #rhev-integ)
>>>
_______________________________________________
>>> Devel mailing list --
>>> devel(a)ovirt.org
>>> <mailto:devel@ovirt.org>
>>> To unsubscribe send an email to
>>> devel-leave(a)ovirt.org
>>> <mailto:devel-leave@ovirt.org>
>>> Privacy Statement:
>>>
https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
>>>
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUV...
>>>
--
SANDRO BONAZZOLA
MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
Red Hat EMEA <
https://www.redhat.com/>
sbonazzo(a)redhat.com <mailto:sbonazzo@redhat.com>
<
https://red.ht/sig>
_______________________________________________
Devel mailing list -- devel(a)ovirt.org <mailto:devel@ovirt.org>
To unsubscribe send an email to devel-leave(a)ovirt.org
<mailto:devel-leave@ovirt.org>
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SR7YJIIZHM3...
--
Eyal edri
MANAGER
RHV/CNV DevOps
EMEA VIRTUALIZATION R&D
Red Hat EMEA <
https://www.redhat.com/>
<
https://red.ht/sig> TRIED. TESTED. TRUSTED.
<
https://redhat.com/trusted>
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)
_______________________________________________
Devel mailing list -- devel(a)ovirt.org <mailto:devel@ovirt.org>
To unsubscribe send an email to devel-leave(a)ovirt.org
<mailto:devel-leave@ovirt.org>
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ABB6UAM37JO...