Dana, are you still seeing it? 

On Sun, Apr 21, 2019 at 12:08 PM Dana Elfassy <delfassy@redhat.com> wrote:
Hi,
Are there any updates on this issue?
I'm getting this failure as well:
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/4582/console

Thanks,
Dana

On Mon, Apr 8, 2019 at 4:18 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:

Hi,

yep, saw that too... I will get back to it as soon as I'm done with patches I'm currently working on.

Thanks, Marcin

On 4/8/19 2:58 PM, Dafna Ron wrote:
Hi all,

I see that the patch was merged 4 days ago but we still see random failures. 

Thanks,
Dafna




On Thu, Apr 4, 2019 at 10:53 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:

Hi,

just to keep everyone up to date - Ravi wrote a patch that most probably will resolve this issue:

https://gerrit.ovirt.org/#/c/99124/

Dafna, please keep an eye on the results of CQ pipelines to come and give us feedback if the patch helped or not.

Thanks, Marcin

On 3/29/19 3:44 PM, Dafna Ron wrote:
Ok, thanks.

On Fri, Mar 29, 2019 at 11:20 AM Martin Perina <mperina@redhat.com> wrote:


On Fri, 29 Mar 2019, 11:54 Dafna Ron, <dron@redhat.com> wrote:
Thanks Martin.
I will follow once merged and report the next failure so we can have a better view.

I'm not going to merge it, but I will try to execute OST with it until failure. 



On Fri, Mar 29, 2019 at 10:07 AM Martin Perina <mperina@redhat.com> wrote:


On Fri, Mar 29, 2019 at 11:04 AM Dafna Ron <dron@redhat.com> wrote:
Thank you Marcin :) I am also adding Didi to see if he knows of any changes.
Anton, who from Engine/api side can help to take it from here?


On Fri, Mar 29, 2019 at 9:37 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:

Hi,

so the thing boils down to this:

https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/471/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_etc_httpd/logs/ssl_access_log/*view*/

This is 'httpd' log file from the most recent job that failed 'add_master_storage_domain'.
At the end of the log file, we can see:
192.168.202.1 - - [28/Mar/2019:11:05:38 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing HTTP/1.1" 200 11845
192.168.202.1 - - [28/Mar/2019:11:05:41 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing HTTP/1.1" 200 11847
192.168.202.1 - - [28/Mar/2019:11:05:44 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing HTTP/1.1" 200 11847
192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing HTTP/1.1" 200 5959
192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 200 7826
192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET /ovirt-engine/api HTTP/1.1" 200 5112
192.168.202.1 - - [28/Mar/2019:11:05:48 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 200 65
192.168.202.2 - - [28/Mar/2019:11:05:50 -0400] "POST /ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 223
The first batch of 'status=installing or status=initializing' requests and the first 'status=up' request is 'verify_add_hosts'.
The second 'status=up' request is the one that comes from '_hosts_in_dc' call from 'add_master_storage_domain'.
They are executed with a 1 second time difference, yet the results are not the same - 7826 bytes vs 65 bytes.
I went through every line of 'engine.log' from that 1 sec interval and couldn't find anything, that would indicate the host is going down.
I've asked Ravi for help yesterday and I've found out, that the results of these requests are queries to our postgres db and they are somehow cached.
I see 2 options here - either something really changed the db contents during that 1 second or there's a bug in our API/cache that returns the wrong info.
I think we can eliminate the first option by adding collection of postgres db to our OST suites and looking at it's history after a failure.
I don't really have a solution for the second possibility - any help from someone who knows engine's code better would be appreciated.

I've created https://gerrit.ovirt.org/99027 to add logging of hosts and statuses return by those search queries, so we should have more data what's exactly returned


On 3/26/19 5:37 PM, Dafna Ron wrote:
We never remove any packages from tested :)
I am actually suspecting a problem that is caused by something that is shared by all projects like the api.
we currently have 3 different tests that are randomly failing because of engine/host communication and I am starting to think they are all connected.


On Tue, Mar 26, 2019 at 2:03 PM Eyal Edri <eedri@redhat.com> wrote:


On Tue, Mar 26, 2019 at 3:56 PM Sandro Bonazzola <sbonazzo@redhat.com> wrote:


Il giorno mar 26 mar 2019 alle ore 14:42 Marcin Sobczyk <msobczyk@redhat.com> ha scritto:

Ok, maybe I've found something. I was comparing two sets of log files - one for basic-suite-4.3 run that succeeded and one that did not.

The deployment procedure for host-1 is not complete in the unsuccessful case. The last log entry is:

2019-03-22 17:54:49,973-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host lago-basic-suite-4-3-host-1. Starting vdsm.

In the successful case we can see:

2019-03-25 06:16:31,789-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host lago-basic-suite-4-3-host-1. Starting vdsm.                                                                                                                          
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventEnd STAGE closeup METHOD otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start (odeploycons.packages.vdsm.started)                                                                                                                  
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventStart STAGE closeup METHOD otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)                                                                                                                                        
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: ***L:INFO Starting ovirt-vmconsole-host-sshd                        
2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] nextEvent: Log INFO Starting ovirt-vmconsole-host-sshd                   
2019-03-25 06:16:38,261-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd.                                                                                                     
2019-03-25 06:16:39,479-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventEnd STAGE closeup METHOD otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None)
...

It looks like ovirt-host-deploy plugin did not respond?

The timing of:

https://gerrit.ovirt.org/#/c/98762/

kinda aligns with my patch which was merged for vdsm-4.3 and was a suspect:

https://gerrit.ovirt.org/#/c/98748/

Sandro, are you aware of any issues on ovirt-host-deploy plugin that might have caused these?


I don't really see how a patch touching the automation configuratin for STDCI v2 can be the cause of this but please go ahead and blacklist last build.
It should differ from previous one only by the timestamp of the rpm :-)


Ehhhh, yeah, I'm so tired of looking at log files I'm starting to see patterns everywhere...



If you want to verify a specific version of an RPM works , you can use the manual job for it.
 
 

Dafna, can we temporarily blacklist the latest version of ovirt-host-deploy to see if it fixes CQ?

Marcin

On 3/25/19 2:45 PM, Dafna Ron wrote:
yes. I suspect this is the problem.
In the host I can see in the yum log the times the vdsm packages are installed:
[dron@dron post-002_bootstrap.py]$ less lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm
Mar 25 07:48:49 Installed: vdsm-api-4.30.11-23.git276b602.el7.noarch
Mar 25 07:48:52 Installed: vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:00 Installed: vdsm-common-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:00 Installed: vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:38 Installed: vdsm-network-4.30.11-23.git276b602.el7.x86_64
Mar 25 07:49:38 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:39 Installed: vdsm-client-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:39 Installed: vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:39 Installed: vdsm-http-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:42 Installed: vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:57 Installed: vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:58 Installed: vdsm-4.30.11-23.git276b602.el7.x86_64
Mar 25 07:49:58 Installed: vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch
Mar 25 07:49:58 Installed: vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch

and in the engine you can see that the engine already failed installation:
2019-03-25 07:48:33,920-04 INFO  [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false. Entities affected :  ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST with role type ADMIN

On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron <dron@redhat.com> wrote:
I am also adding didi
should this not be logged in host-deploy?


On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:

Indeed in my failed job, I can see, that 'vdsm-python' is installed *after* engine's attempt to run 'vdsm-tool':

lago-basic-suite-4-3-host-1/_var_log/yum.log
403:Mar 22 17:54:12 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch
lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log
1580:2019-03-22 17:53:03,793-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool vdsm-id'
1581:2019-03-22 17:53:04,043-04 WARN  [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id request on host: Command returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1'


On 3/25/19 2:22 PM, Dafna Ron wrote:
and its only failing on one host as well.


On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:

Hmmm, that would suggest, that 'vdsm-tool' command is not found on the host. There was no work done around 'vdsm-tool's 'vdsm-id' recently.

On 3/25/19 2:08 PM, Dafna Ron wrote:
Actually, I can see in engine that it is trying to start vdsm installation on the host:

2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] Entered SsoRestApiAuthFilter
2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] SsoRestApiAuthFilter authenticating with sso
2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] SsoRestApiAuthFilter authenticating using BEARER header
2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] SsoRestApiAuthFilter successfully authenticated using BEARER header
2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) [] Entered SsoRestApiNegotiationFilter
2019-03-25 07:48:33,124-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) [] SsoRestApiNegotiationFilter Not performing Negotiate Auth
2019-03-25 07:48:33,143-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call string is [{call getclusterforuserbyclustername(?, ?, ?)}]
2019-03-25 07:48:33,143-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure [GetClusterForUserByClusterName] compiled
2019-03-25 07:48:33,145-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, params: [GetClusterByName, NameQueryParameters:{refresh='false', filtered='false'}], timeElapsed: 7ms
2019-03-25 07:48:33,161-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [] Compiled stored procedure. Call string is [{call getvmstaticbyname(?)}]
2019-03-25 07:48:33,161-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled
2019-03-25 07:48:33,206-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call string is [{call getvdsbyname(?)}]
2019-03-25 07:48:33,206-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure [GetVdsByName] compiled
2019-03-25 07:48:33,223-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, params: [lago-basic-suite-4-3-host-1], timeElapsed: 20ms
2019-03-25 07:48:33,225-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call string is [{call getvdsbyhostname(?)}]
2019-03-25 07:48:33,225-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure [GetVdsByHostName] compiled
2019-03-25 07:48:33,252-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getAllForHostname, params: [lago-basic-suite-4-3-host-1], timeElapsed: 28ms
2019-03-25 07:48:33,254-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call string is [{call getstorage_poolsbyclusterid(?)}]
2019-03-25 07:48:33,254-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure [Getstorage_poolsByClusterId] compiled
2019-03-25 07:48:33,262-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Connecting 'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,461-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Connected: 'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,467-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating: 'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,599-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated: 'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,600-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool vdsm-id'
2019-03-25 07:48:33,902-04 WARN  [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request on host: Command returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1'
2019-03-25 07:48:33,920-04 INFO  [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false. Entities affected :  ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST with role type ADMIN



On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron <dron@redhat.com> wrote:
we have another failing patch with the same test:

its obviously not related to the patch but there is something that is causing these failures randomly.
From what I can see in the current failed job, you are correct and host-1 and engine does not even try to deploy host-1.

i can see host-1 is getting error 127 in lago for ntpd and I can see network manager errors in the host messages log

In the host messages log I can see several messages that I suspect may cause issues in communication between engine and host:

Mar 25 07:50:09 lago-basic-suite-4-3-host-1 sasldblistusers2: _sasldb_getkeyhandle has failed
Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found
Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found
Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found


ces/12)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-UMJa2S is being created...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-UMJa2S is being deleted...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info>  [1553514613.7774] manager: (bondscan-UMJa2S): new Bond device (/org/freedesktop/NetworkManager/Devices/13)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered net_device) (unregistering): Released all slaves
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-liwvMR is being created...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option fail_over_mac: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option ad_select: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option lacp_rate: mode dependency failed, not supported in mode balance-rr(0)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option arp_validate: invalid value (7)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option xmit_hash_policy: invalid value (5)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info>  [1553514613.8002] manager: (bondscan-liwvMR): new Bond device (/org/freedesktop/NetworkManager/Devices/14)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option primary_reselect: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option arp_all_targets: invalid value (2)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-liwvMR is being deleted...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered net_device) (unregistering): Released all slaves
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-liwvMR is being created...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option fail_over_mac: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option ad_select: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option lacp_rate: mode dependency failed, not supported in mode active-backup(1)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option arp_validate: invalid value (7)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 NetworkManager[2658]: <info>  [1553514613.8429] manager: (bondscan-liwvMR): new Bond device (/org/freedesktop/NetworkManager/Devices/15)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option xmit_hash_policy: invalid value (5)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option primary_reselect: invalid value (3)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bondscan-liwvMR: option arp_all_targets: invalid value (2)
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: bondscan-liwvMR is being deleted...
Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: (unregistered net_device) (unregistering): Released all slaves


On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:

For the failed job, the engine didn't even try to deploy on host-1:

https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/

Martin, do you know what could be the reason for that?

I can see in the logs for both successful and unsuccessful basic-suite-4.3 runs, that there is no 'ntpdate' on host-1:

2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 on lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine
2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 on lago-basic-suite-4-3-host-1 returned with 127
2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 on lago-basic-suite-4-3-host-1  errors:
 bash: ntpdate: command not found

On host-0 everything is ok:

2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 on lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine
2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 on lago-basic-suite-4-3-host-0 returned with 0
2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 on lago-basic-suite-4-3-host-0 output:
 25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 offset 0.017150 sec
On 3/25/19 10:13 AM, Eyal Edri wrote:
Still fails, now on a different component. ( ovirt-web-ui-extentions )


On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg <danken@redhat.com> wrote:


On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:

Dafna,

in 'verify_add_hosts' we specifically wait for single host to be up with a timeout:

 144     up_hosts = hosts_service.list(search='datacenter={} AND status=up'.format(DC_NAME))                                                      
 145     if len(up_hosts):                                                                                                                        
 146         return True
The log files say, that it took ~50 secs for one of the hosts to be up (seems reasonable) and no timeout is being reported.
Just after running 'verify_add_hosts', we run 'add_master_storage_domain', which calls '_hosts_in_dc' function.
That function does the exact same check, but it fails:
 113     hosts = hosts_service.list(search='datacenter={} AND status=up'.format(dc_name))                                                         
 114     if hosts:                                                                                                                                
 115         if random_host:                                                                                                                      
 116             return random.choice(hosts)                                                                                                      
I don't think it is relevant to our current failure; but I consider random_host=True as a bad practice. As if we do not have enough moving parts, we are adding intentional randomness. Reproducibility is far more important than coverage - particularly for a shared system test like OST.
 117         else:                                                                                                                                
 118             return sorted(hosts, key=lambda host: host.name)                                                                                 
 119     raise RuntimeError('Could not find hosts that are up in DC %s' % dc_name)

I'm also not able to reproduce this issue locally on my server. The investigation continues...


I think that it would be fair to take the filtering by host state out of Engine and into the test, where we can easily log the current status of each host. Then we'd have better understanding on the next failure.

On 3/22/19 1:17 PM, Marcin Sobczyk wrote:

Hi,

sure, I'm on it - it's weird though, I did ran 4.3 basic suite for this patch manually and everything was ok.

On 3/22/19 1:05 PM, Dafna Ron wrote:
Hi,

We are failing branch 4.3 for test: 002_bootstrap.add_master_storage_domain

It seems that in one of the hosts, the vdsm is not starting
there is nothing in vdsm.log or in supervdsm.log

CQ identified this patch as the suspected root cause:

https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add support for flow id

Milan, Marcin, can you please have a look?

full logs:


the only error I can see is about host not being up (makes sense as vdsm is not running)

Stacktrace

  File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
    testMethod()
  File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in wrapped_test
    test()
  File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper
    return func(get_test_prefix(), *args, **kwargs)
  File "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", line 417, in add_master_storage_domain
    add_iscsi_storage_domain(prefix)
  File "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", line 561, in add_iscsi_storage_domain
    host=_random_host_from_dc(api, DC_NAME),
  File "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", line 122, in _random_host_from_dc
    return _hosts_in_dc(api, dc_name, True)
  File "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", line 119, in _hosts_in_dc
    raise RuntimeError('Could not find hosts that are up in DC %s' % dc_name)
'Could not find hosts that are up in DC test-dc\n-------------------- >> begin captured logging << --------------------\nlago.ssh: DEBUG: start task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine returned with 0\nlago.ssh: DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine output:\n 3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n-----------
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXWW36DZKAUL5DN7WBNW4/
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT5EDFYKLW5GFA52YZ7U/


--

Eyal edri


MANAGER

RHV/CNV DevOps

EMEA VIRTUALIZATION R&D


Red Hat EMEA

TRIED. TESTED. TRUSTED.
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2XNVBS2X7OAHGOXJZROL/


--

SANDRO BONAZZOLA

MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV

Red Hat EMEA

sbonazzo@redhat.com   

_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SR7YJIIZHM36KRKI2XJYBEEI6CU4PCKF/


--
<div dir="ltr" class="gmail-m_-929359959023254178gmail-m_-3007985368740475924gmail-m_-3898735433743
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JVE4CPW2T53KE6UORJ44EZA7IR4JJKF2/