Ok, thanks.
On Fri, Mar 29, 2019 at 11:20 AM Martin Perina <mperina(a)redhat.com> wrote:
On Fri, 29 Mar 2019, 11:54 Dafna Ron, <dron(a)redhat.com> wrote:
> Thanks Martin.
> I will follow once merged and report the next failure so we can have a
> better view.
>
I'm not going to merge it, but I will try to execute OST with it until
failure.
>
> On Fri, Mar 29, 2019 at 10:07 AM Martin Perina <mperina(a)redhat.com>
> wrote:
>
>>
>>
>> On Fri, Mar 29, 2019 at 11:04 AM Dafna Ron <dron(a)redhat.com> wrote:
>>
>>> Thank you Marcin :) I am also adding Didi to see if he knows of any
>>> changes.
>>> Anton, who from Engine/api side can help to take it from here?
>>>
>>>
>>> On Fri, Mar 29, 2019 at 9:37 AM Marcin Sobczyk <msobczyk(a)redhat.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> so the thing boils down to this:
>>>>
>>>>
>>>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/471/artifact/...
>>>> This is 'httpd' log file from the most recent job that failed
>>>> 'add_master_storage_domain'.
>>>> At the end of the log file, we can see:
>>>>
>>>> 192.168.202.1 - - [28/Mar/2019:11:05:38 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 11845
>>>>
>>>> 192.168.202.1 - - [28/Mar/2019:11:05:41 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 11847
>>>>
>>>> 192.168.202.1 - - [28/Mar/2019:11:05:44 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 11847
>>>>
>>>> 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing
HTTP/1.1" 200 5959
>>>>
>>>> 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 200
7826
>>>>
>>>> 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET
/ovirt-engine/api HTTP/1.1" 200 5112
>>>>
>>>> 192.168.202.1 - - [28/Mar/2019:11:05:48 -0400] "GET
/ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 200 65
>>>>
>>>> 192.168.202.2 - - [28/Mar/2019:11:05:50 -0400] "POST
/ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 223
>>>>
>>>> The first batch of 'status=installing or status=initializing'
requests
>>>> and the first 'status=up' request is 'verify_add_hosts'.
>>>> The second 'status=up' request is the one that comes from
>>>> '_hosts_in_dc' call from 'add_master_storage_domain'.
>>>> They are executed with a 1 second time difference, yet the results are
>>>> not the same - 7826 bytes vs 65 bytes.
>>>> I went through every line of 'engine.log' from that 1 sec
interval and
>>>> couldn't find anything, that would indicate the host is going down.
>>>> I've asked Ravi for help yesterday and I've found out, that the
>>>> results of these requests are queries to our postgres db and they are
>>>> somehow cached.
>>>> I see 2 options here - either something really changed the db contents
>>>> during that 1 second or there's a bug in our API/cache that returns
the
>>>> wrong info.
>>>> I think we can eliminate the first option by adding collection of
>>>> postgres db to our OST suites and looking at it's history after a
failure.
>>>> I don't really have a solution for the second possibility - any help
>>>> from someone who knows engine's code better would be appreciated.
>>>>
>>>
>> I've created
https://gerrit.ovirt.org/99027 to add logging of hosts and
>> statuses return by those search queries, so we should have more data what's
>> exactly returned
>>
>>
>>>> On 3/26/19 5:37 PM, Dafna Ron wrote:
>>>>
>>>> We never remove any packages from tested :)
>>>> I am actually suspecting a problem that is caused by something that is
>>>> shared by all projects like the api.
>>>> we currently have 3 different tests that are randomly failing because
>>>> of engine/host communication and I am starting to think they are all
>>>> connected.
>>>>
>>>>
>>>> On Tue, Mar 26, 2019 at 2:03 PM Eyal Edri <eedri(a)redhat.com>
wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 26, 2019 at 3:56 PM Sandro Bonazzola
<sbonazzo(a)redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Il giorno mar 26 mar 2019 alle ore 14:42 Marcin Sobczyk <
>>>>>> msobczyk(a)redhat.com> ha scritto:
>>>>>>
>>>>>>> Ok, maybe I've found something. I was comparing two sets
of log
>>>>>>> files - one for basic-suite-4.3 run that succeeded and one
that did not.
>>>>>>>
>>>>>>> The deployment procedure for host-1 is not complete in the
>>>>>>> unsuccessful case. The last log entry is:
>>>>>>>
>>>>>>> 2019-03-22 17:54:49,973-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting vdsm.
>>>>>>>
>>>>>>> In the successful case we can see:
>>>>>>>
>>>>>>> 2019-03-25 06:16:31,789-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting vdsm.
>>>>>>> 2019-03-25 06:16:38,257-04 DEBUG
[org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventEnd STAGE
closeup METHOD otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start
(odeploycons.packages.vdsm.started)
>>>>>>> 2019-03-25 06:16:38,257-04 DEBUG
[org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventStart
STAGE closeup METHOD otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start
(None)
>>>>>>> 2019-03-25 06:16:38,257-04 DEBUG
[org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: ***L:INFO
Starting ovirt-vmconsole-host-sshd
>>>>>>> 2019-03-25 06:16:38,257-04 DEBUG
[org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] nextEvent: Log INFO
Starting ovirt-vmconsole-host-sshd
>>>>>>> 2019-03-25 06:16:38,261-04 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy)
[3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host
lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd.
>>>>>>> 2019-03-25 06:16:39,479-04 DEBUG
[org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventEnd STAGE
closeup METHOD otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
>>>>>>> ...
>>>>>>>
>>>>>>>
>>>>>>> It looks like ovirt-host-deploy plugin did not respond?
>>>>>>>
>>>>>>> The timing of:
>>>>>>>
>>>>>>>
https://gerrit.ovirt.org/#/c/98762/
>>>>>>>
>>>>>>> kinda aligns with my patch which was merged for vdsm-4.3 and
was a
>>>>>>> suspect:
>>>>>>>
>>>>>>>
https://gerrit.ovirt.org/#/c/98748/
>>>>>>>
>>>>>>> Sandro, are you aware of any issues on ovirt-host-deploy
plugin
>>>>>>> that might have caused these?
>>>>>>>
>>>>>>
>>>>>> I don't really see how a patch touching the automation
configuratin
>>>>>> for STDCI v2 can be the cause of this but please go ahead and
blacklist
>>>>>> last build.
>>>>>> It should differ from previous one only by the timestamp of the
rpm
>>>>>> :-)
>>>>>>
>>>>>
>>>> Ehhhh, yeah, I'm so tired of looking at log files I'm starting to
see
>>>> patterns everywhere...
>>>>
>>>>
>>>>
>>>>> If you want to verify a specific version of an RPM works , you can
>>>>> use the manual job for it.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>> Dafna, can we temporarily blacklist the latest version of
>>>>>>> ovirt-host-deploy to see if it fixes CQ?
>>>>>>>
>>>>>>> Marcin
>>>>>>> On 3/25/19 2:45 PM, Dafna Ron wrote:
>>>>>>>
>>>>>>> yes. I suspect this is the problem.
>>>>>>> In the host I can see in the yum log the times the vdsm
packages
>>>>>>> are installed:
>>>>>>> [dron@dron post-002_bootstrap.py]$ less
>>>>>>> lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm
>>>>>>> Mar 25 07:48:49 Installed:
vdsm-api-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:48:52 Installed:
>>>>>>> vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:49:00 Installed:
>>>>>>> vdsm-common-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:49:00 Installed:
>>>>>>> vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:49:38 Installed:
>>>>>>> vdsm-network-4.30.11-23.git276b602.el7.x86_64
>>>>>>> Mar 25 07:49:38 Installed:
>>>>>>> vdsm-python-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:49:39 Installed:
>>>>>>> vdsm-client-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:49:39 Installed:
>>>>>>> vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:49:39 Installed:
>>>>>>> vdsm-http-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:49:42 Installed:
>>>>>>> vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:49:57 Installed:
>>>>>>> vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:49:58 Installed:
vdsm-4.30.11-23.git276b602.el7.x86_64
>>>>>>> Mar 25 07:49:58 Installed:
>>>>>>> vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch
>>>>>>> Mar 25 07:49:58 Installed:
>>>>>>> vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch
>>>>>>>
>>>>>>> and in the engine you can see that the engine already failed
>>>>>>> installation:
>>>>>>>
>>>>>>> 2019-03-25 07:48:33,920-04 INFO
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false.
Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group
CREATE_HOST with role type ADMIN
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron
<dron(a)redhat.com> wrote:
>>>>>>>
>>>>>>>> I am also adding didi
>>>>>>>> should this not be logged in host-deploy?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk <
>>>>>>>> msobczyk(a)redhat.com> wrote:
>>>>>>>>
>>>>>>>>> Indeed in my failed job, I can see, that
'vdsm-python' is
>>>>>>>>> installed *after* engine's attempt to run
'vdsm-tool':
>>>>>>>>>
>>>>>>>>> lago-basic-suite-4-3-host-1/_var_log/yum.log
>>>>>>>>> 403:Mar 22 17:54:12 Installed:
vdsm-python-4.30.11-23.git276b602.el7.noarch
>>>>>>>>>
>>>>>>>>>
lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log
>>>>>>>>> 1580:2019-03-22 17:53:03,793-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool vdsm-id'
>>>>>>>>> 1581:2019-03-22 17:53:04,043-04 WARN
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id request on host: Command
returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1'
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/25/19 2:22 PM, Dafna Ron wrote:
>>>>>>>>>
>>>>>>>>> and its only failing on one host as well.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk <
>>>>>>>>> msobczyk(a)redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hmmm, that would suggest, that
'vdsm-tool' command is not found
>>>>>>>>>> on the host. There was no work done around
'vdsm-tool's 'vdsm-id' recently.
>>>>>>>>>> On 3/25/19 2:08 PM, Dafna Ron wrote:
>>>>>>>>>>
>>>>>>>>>> Actually, I can see in engine that it is trying
to start vdsm
>>>>>>>>>> installation on the host:
>>>>>>>>>>
>>>>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] Entered
SsoRestApiAuthFilter
>>>>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter authenticating with sso
>>>>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter authenticating using BEARER header
>>>>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) []
SsoRestApiAuthFilter successfully authenticated using BEARER header
>>>>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) []
Entered SsoRestApiNegotiationFilter
>>>>>>>>>> 2019-03-25 07:48:33,124-04 DEBUG
[org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) []
SsoRestApiNegotiationFilter Not performing Negotiate Auth
>>>>>>>>>> 2019-03-25 07:48:33,143-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getclusterforuserbyclustername(?, ?, ?)}]
>>>>>>>>>> 2019-03-25 07:48:33,143-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetClusterForUserByClusterName] compiled
>>>>>>>>>> 2019-03-25 07:48:33,145-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, params: [GetClusterByName,
NameQueryParameters:{refresh='false', filtered='false'}], timeElapsed:
7ms
>>>>>>>>>> 2019-03-25 07:48:33,161-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] Compiled stored procedure. Call string is [{call
getvmstaticbyname(?)}]
>>>>>>>>>> 2019-03-25 07:48:33,161-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
>>>>>>>>>> 2019-03-25 07:48:33,206-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getvdsbyname(?)}]
>>>>>>>>>> 2019-03-25 07:48:33,206-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByName] compiled
>>>>>>>>>> 2019-03-25 07:48:33,223-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 20ms
>>>>>>>>>> 2019-03-25 07:48:33,225-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getvdsbyhostname(?)}]
>>>>>>>>>> 2019-03-25 07:48:33,225-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[GetVdsByHostName] compiled
>>>>>>>>>> 2019-03-25 07:48:33,252-04 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] method: getAllForHostname, params:
[lago-basic-suite-4-3-host-1], timeElapsed: 28ms
>>>>>>>>>> 2019-03-25 07:48:33,254-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call
string is [{call getstorage_poolsbyclusterid(?)}]
>>>>>>>>>> 2019-03-25 07:48:33,254-04 DEBUG
[org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall]
(default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure
[Getstorage_poolsByClusterId] compiled
>>>>>>>>>> 2019-03-25 07:48:33,262-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Connecting
'root@lago-basic-suite-4-3-host-1'
>>>>>>>>>> 2019-03-25 07:48:33,461-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Connected:
'root@lago-basic-suite-4-3-host-1'
>>>>>>>>>> 2019-03-25 07:48:33,467-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating:
'root@lago-basic-suite-4-3-host-1'
>>>>>>>>>> 2019-03-25 07:48:33,599-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated:
'root@lago-basic-suite-4-3-host-1'
>>>>>>>>>> 2019-03-25 07:48:33,600-04 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool vdsm-id'
>>>>>>>>>> 2019-03-25 07:48:33,902-04 WARN
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request on host: Command
returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1'
>>>>>>>>>> 2019-03-25 07:48:33,920-04 INFO
[org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1)
[6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false.
Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group
CREATE_HOST with role type ADMIN
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron
<dron(a)redhat.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> we have another failing patch with the same
test:
>>>>>>>>>>>
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/
>>>>>>>>>>>
>>>>>>>>>>> its obviously not related to the patch but
there is something
>>>>>>>>>>> that is causing these failures randomly.
>>>>>>>>>>> From what I can see in the current failed
job, you are correct
>>>>>>>>>>> and host-1 and engine does not even try to
deploy host-1.
>>>>>>>>>>>
>>>>>>>>>>> i can see host-1 is getting error 127 in lago
for ntpd and I
>>>>>>>>>>> can see network manager errors in the host
messages log
>>>>>>>>>>>
>>>>>>>>>>> In the host messages log I can see several
messages that I
>>>>>>>>>>> suspect may cause issues in communication
between engine and host:
>>>>>>>>>>>
>>>>>>>>>>> Mar 25 07:50:09 lago-basic-suite-4-3-host-1
sasldblistusers2:
>>>>>>>>>>> _sasldb_getkeyhandle has failed
>>>>>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1
saslpasswd2: error
>>>>>>>>>>> deleting entry from sasldb: BDB0073
DB_NOTFOUND: No matching key/data pair
>>>>>>>>>>> found
>>>>>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1
saslpasswd2: error
>>>>>>>>>>> deleting entry from sasldb: BDB0073
DB_NOTFOUND: No matching key/data pair
>>>>>>>>>>> found
>>>>>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1
saslpasswd2: error
>>>>>>>>>>> deleting entry from sasldb: BDB0073
DB_NOTFOUND: No matching key/data pair
>>>>>>>>>>> found
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ces/12)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel: bonding:
>>>>>>>>>>> bondscan-UMJa2S is being created...
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel: bonding:
>>>>>>>>>>> bondscan-UMJa2S is being deleted...
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
>>>>>>>>>>> NetworkManager[2658]: <info>
[1553514613.7774] manager: (bondscan-UMJa2S):
>>>>>>>>>>> new Bond device
(/org/freedesktop/NetworkManager/Devices/13)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> (unregistered net_device) (unregistering):
Released all slaves
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel: bonding:
>>>>>>>>>>> bondscan-liwvMR is being created...
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option fail_over_mac:
invalid value (3)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option ad_select: invalid
value (3)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option lacp_rate: mode
dependency failed, not supported in
>>>>>>>>>>> mode balance-rr(0)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option arp_validate: invalid
value (7)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option xmit_hash_policy:
invalid value (5)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
>>>>>>>>>>> NetworkManager[2658]: <info>
[1553514613.8002] manager: (bondscan-liwvMR):
>>>>>>>>>>> new Bond device
(/org/freedesktop/NetworkManager/Devices/14)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option primary_reselect:
invalid value (3)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option arp_all_targets:
invalid value (2)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel: bonding:
>>>>>>>>>>> bondscan-liwvMR is being deleted...
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> (unregistered net_device) (unregistering):
Released all slaves
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel: bonding:
>>>>>>>>>>> bondscan-liwvMR is being created...
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option fail_over_mac:
invalid value (3)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option ad_select: invalid
value (3)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option lacp_rate: mode
dependency failed, not supported in
>>>>>>>>>>> mode active-backup(1)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option arp_validate: invalid
value (7)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
>>>>>>>>>>> NetworkManager[2658]: <info>
[1553514613.8429] manager: (bondscan-liwvMR):
>>>>>>>>>>> new Bond device
(/org/freedesktop/NetworkManager/Devices/15)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option xmit_hash_policy:
invalid value (5)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option primary_reselect:
invalid value (3)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> bondscan-liwvMR: option arp_all_targets:
invalid value (2)
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel: bonding:
>>>>>>>>>>> bondscan-liwvMR is being deleted...
>>>>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1
kernel:
>>>>>>>>>>> (unregistered net_device) (unregistering):
Released all slaves
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 25, 2019 at 11:24 AM Marcin
Sobczyk <
>>>>>>>>>>> msobczyk(a)redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> For the failed job, the engine didn't
even try to deploy on
>>>>>>>>>>>> host-1:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/...
>>>>>>>>>>>>
>>>>>>>>>>>> Martin, do you know what could be the
reason for that?
>>>>>>>>>>>>
>>>>>>>>>>>> I can see in the logs for both successful
and unsuccessful
>>>>>>>>>>>> basic-suite-4.3 runs, that there is no
'ntpdate' on host-1:
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-25
10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 on
lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine
>>>>>>>>>>>> 2019-03-25
10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 returned with 127
>>>>>>>>>>>> 2019-03-25
10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 on
lago-basic-suite-4-3-host-1 errors:
>>>>>>>>>>>> bash: ntpdate: command not found
>>>>>>>>>>>>
>>>>>>>>>>>> On host-0 everything is ok:
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-25
10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 on
lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine
>>>>>>>>>>>> 2019-03-25
10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 returned with 0
>>>>>>>>>>>> 2019-03-25
10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 on
lago-basic-suite-4-3-host-0 output:
>>>>>>>>>>>> 25 Mar 06:14:53 ntpdate[6646]: adjust
time server 192.168.202.2 offset 0.017150 sec
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/25/19 10:13 AM, Eyal Edri wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Still fails, now on a different
component. (
>>>>>>>>>>>> ovirt-web-ui-extentions )
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 22, 2019 at 3:59 PM Dan
Kenigsberg <
>>>>>>>>>>>> danken(a)redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Mar 22, 2019 at 3:21 PM
Marcin Sobczyk <
>>>>>>>>>>>>> msobczyk(a)redhat.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dafna,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> in 'verify_add_hosts' we
specifically wait for single host
>>>>>>>>>>>>>> to be up with a timeout:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 144 up_hosts =
hosts_service.list(search='datacenter={} AND status=up'.format(DC_NAME))
>>>>>>>>>>>>>> 145 if len(up_hosts):
>>>>>>>>>>>>>> 146 return True
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The log files say, that it took
~50 secs for one of the
>>>>>>>>>>>>>> hosts to be up (seems reasonable)
and no timeout is being reported.
>>>>>>>>>>>>>> Just after running
'verify_add_hosts', we run
>>>>>>>>>>>>>>
'add_master_storage_domain', which calls '_hosts_in_dc' function.
>>>>>>>>>>>>>> That function does the exact same
check, but it fails:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 113 hosts =
hosts_service.list(search='datacenter={} AND status=up'.format(dc_name))
>>>>>>>>>>>>>> 114 if hosts:
>>>>>>>>>>>>>> 115 if random_host:
>>>>>>>>>>>>>> 116 return
random.choice(hosts)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think it is relevant
to our current failure; but I
>>>>>>>>>>>>> consider random_host=True as a bad
practice. As if we do not have enough
>>>>>>>>>>>>> moving parts, we are adding
intentional randomness. Reproducibility is far
>>>>>>>>>>>>> more important than coverage -
particularly for a shared system test like
>>>>>>>>>>>>> OST.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 117 else:
>>>>>>>>>>>>>> 118 return
sorted(hosts, key=lambda host: host.name)
>>>>>>>>>>>>>> 119 raise
RuntimeError('Could not find hosts that are up in DC %s' % dc_name)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm also not able to
reproduce this issue locally on my
>>>>>>>>>>>>>> server. The investigation
continues...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think that it would be fair to take
the filtering by host
>>>>>>>>>>>>> state out of Engine and into the
test, where we can easily log the current
>>>>>>>>>>>>> status of each host. Then we'd
have better understanding on the next
>>>>>>>>>>>>> failure.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 3/22/19 1:17 PM, Marcin Sobczyk
wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sure, I'm on it - it's
weird though, I did ran 4.3 basic
>>>>>>>>>>>>>> suite for this patch manually and
everything was ok.
>>>>>>>>>>>>>> On 3/22/19 1:05 PM, Dafna Ron
wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are failing branch 4.3 for
test:
>>>>>>>>>>>>>>
002_bootstrap.add_master_storage_domain
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems that in one of the
hosts, the vdsm is not starting
>>>>>>>>>>>>>> there is nothing in vdsm.log or
in supervdsm.log
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CQ identified this patch as the
suspected root cause:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add
>>>>>>>>>>>>>> support for flow id
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Milan, Marcin, can you please
have a look?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> full logs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/b...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the only error I can see is about
host not being up (makes
>>>>>>>>>>>>>> sense as vdsm is not running)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Stacktrace
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> File
"/usr/lib64/python2.7/unittest/case.py", line 369, in run
>>>>>>>>>>>>>> testMethod()
>>>>>>>>>>>>>> File
"/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
>>>>>>>>>>>>>> self.test(*self.arg)
>>>>>>>>>>>>>> File
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in
wrapped_test
>>>>>>>>>>>>>> test()
>>>>>>>>>>>>>> File
"/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper
>>>>>>>>>>>>>> return
func(get_test_prefix(), *args, **kwargs)
>>>>>>>>>>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 417, in add_master_storage_domain
>>>>>>>>>>>>>>
add_iscsi_storage_domain(prefix)
>>>>>>>>>>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 561, in add_iscsi_storage_domain
>>>>>>>>>>>>>>
host=_random_host_from_dc(api, DC_NAME),
>>>>>>>>>>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 122, in _random_host_from_dc
>>>>>>>>>>>>>> return _hosts_in_dc(api,
dc_name, True)
>>>>>>>>>>>>>> File
"/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py",
line 119, in _hosts_in_dc
>>>>>>>>>>>>>> raise RuntimeError('Could
not find hosts that are up in DC %s' % dc_name)
>>>>>>>>>>>>>> 'Could not find hosts that
are up in DC test-dc\n-------------------- >> begin captured logging <<
--------------------\nlago.ssh: DEBUG: start task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get
ssh client for lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end
task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for
lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on
lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine returned with 0\nlago.ssh: DEBUG: Command c07b5ee2 on
lago-basic-suite-4-3-engine output:\n
3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>> Devel mailing list --
devel(a)ovirt.org
>>>>>>>>>>>>>> To unsubscribe send an email to
devel-leave(a)ovirt.org
>>>>>>>>>>>>>> Privacy Statement:
>>>>>>>>>>>>>>
https://www.ovirt.org/site/privacy-policy/
>>>>>>>>>>>>>> oVirt Code of Conduct:
>>>>>>>>>>>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>>>>>>>>>>>> List Archives:
>>>>>>>>>>>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZY...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>> Devel mailing list --
devel(a)ovirt.org
>>>>>>>>>>>>> To unsubscribe send an email to
devel-leave(a)ovirt.org
>>>>>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>>>>>>>>> oVirt Code of Conduct:
>>>>>>>>>>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>>>>>>>>>>> List Archives:
>>>>>>>>>>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZF...
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Eyal edri
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> MANAGER
>>>>>>>>>>>>
>>>>>>>>>>>> RHV/CNV DevOps
>>>>>>>>>>>>
>>>>>>>>>>>> EMEA VIRTUALIZATION R&D
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Red Hat EMEA
<
https://www.redhat.com/>
>>>>>>>>>>>> <
https://red.ht/sig> TRIED. TESTED.
TRUSTED.
>>>>>>>>>>>> <
https://redhat.com/trusted>
>>>>>>>>>>>> phone: +972-9-7692018
>>>>>>>>>>>> irc: eedri (on #tlv #rhev-dev
#rhev-integ)
>>>>>>>>>>>>
>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>> Devel mailing list -- devel(a)ovirt.org
>>>>>>>>>>>> To unsubscribe send an email to
devel-leave(a)ovirt.org
>>>>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>>>>>>>> oVirt Code of Conduct:
>>>>>>>>>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>>>>>>>>>> List Archives:
>>>>>>>>>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUV...
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> SANDRO BONAZZOLA
>>>>>>
>>>>>> MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
>>>>>>
>>>>>> Red Hat EMEA <
https://www.redhat.com/>
>>>>>>
>>>>>> sbonazzo(a)redhat.com
>>>>>> <
https://red.ht/sig>
>>>>>> _______________________________________________
>>>>>> Devel mailing list -- devel(a)ovirt.org
>>>>>> To unsubscribe send an email to devel-leave(a)ovirt.org
>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>> oVirt Code of Conduct:
>>>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>>>> List Archives:
>>>>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SR7YJIIZHM3...
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> <div dir="ltr"
>>>>>
class="gmail-m_-929359959023254178gmail-m_-3007985368740475924gmail-m_-3898735433743
>>>>>
>>>>