
Dana, are you still seeing it? On Sun, Apr 21, 2019 at 12:08 PM Dana Elfassy <delfassy@redhat.com> wrote:
Hi, Are there any updates on this issue? I'm getting this failure as well:
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests...
Thanks, Dana
On Mon, Apr 8, 2019 at 4:18 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
yep, saw that too... I will get back to it as soon as I'm done with patches I'm currently working on.
Thanks, Marcin On 4/8/19 2:58 PM, Dafna Ron wrote:
Hi all,
I see that the patch was merged 4 days ago but we still see random failures.
Thanks, Dafna
On Thu, Apr 4, 2019 at 10:53 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
just to keep everyone up to date - Ravi wrote a patch that most probably will resolve this issue:
https://gerrit.ovirt.org/#/c/99124/
Dafna, please keep an eye on the results of CQ pipelines to come and give us feedback if the patch helped or not.
Thanks, Marcin On 3/29/19 3:44 PM, Dafna Ron wrote:
Ok, thanks.
On Fri, Mar 29, 2019 at 11:20 AM Martin Perina <mperina@redhat.com> wrote:
On Fri, 29 Mar 2019, 11:54 Dafna Ron, <dron@redhat.com> wrote:
Thanks Martin. I will follow once merged and report the next failure so we can have a better view.
I'm not going to merge it, but I will try to execute OST with it until failure.
On Fri, Mar 29, 2019 at 10:07 AM Martin Perina <mperina@redhat.com> wrote:
On Fri, Mar 29, 2019 at 11:04 AM Dafna Ron <dron@redhat.com> wrote:
> Thank you Marcin :) I am also adding Didi to see if he knows of any > changes. > Anton, who from Engine/api side can help to take it from here? > > > On Fri, Mar 29, 2019 at 9:37 AM Marcin Sobczyk <msobczyk@redhat.com> > wrote: > >> Hi, >> >> so the thing boils down to this: >> >> >> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/471/artifact/bas... >> This is 'httpd' log file from the most recent job that failed >> 'add_master_storage_domain'. >> At the end of the log file, we can see: >> >> 192.168.202.1 - - [28/Mar/2019:11:05:38 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing HTTP/1.1" 200 11845 >> >> 192.168.202.1 - - [28/Mar/2019:11:05:41 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing HTTP/1.1" 200 11847 >> >> 192.168.202.1 - - [28/Mar/2019:11:05:44 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing HTTP/1.1" 200 11847 >> >> 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dinstalling+or+status%3Dinitializing HTTP/1.1" 200 5959 >> >> 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 200 7826 >> >> 192.168.202.1 - - [28/Mar/2019:11:05:47 -0400] "GET /ovirt-engine/api HTTP/1.1" 200 5112 >> >> 192.168.202.1 - - [28/Mar/2019:11:05:48 -0400] "GET /ovirt-engine/api/hosts?search=datacenter%3Dtest-dc+AND+status%3Dup HTTP/1.1" 200 65 >> >> 192.168.202.2 - - [28/Mar/2019:11:05:50 -0400] "POST /ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 223 >> >> The first batch of 'status=installing or status=initializing' >> requests and the first 'status=up' request is 'verify_add_hosts'. >> The second 'status=up' request is the one that comes from >> '_hosts_in_dc' call from 'add_master_storage_domain'. >> They are executed with a 1 second time difference, yet the results >> are not the same - 7826 bytes vs 65 bytes. >> I went through every line of 'engine.log' from that 1 sec interval >> and couldn't find anything, that would indicate the host is going down. >> I've asked Ravi for help yesterday and I've found out, that the >> results of these requests are queries to our postgres db and they are >> somehow cached. >> I see 2 options here - either something really changed the db >> contents during that 1 second or there's a bug in our API/cache that >> returns the wrong info. >> I think we can eliminate the first option by adding collection of >> postgres db to our OST suites and looking at it's history after a failure. >> I don't really have a solution for the second possibility - any >> help from someone who knows engine's code better would be appreciated. >> > I've created https://gerrit.ovirt.org/99027 to add logging of hosts and statuses return by those search queries, so we should have more data what's exactly returned
>> On 3/26/19 5:37 PM, Dafna Ron wrote: >> >> We never remove any packages from tested :) >> I am actually suspecting a problem that is caused by something that >> is shared by all projects like the api. >> we currently have 3 different tests that are randomly failing >> because of engine/host communication and I am starting to think they are >> all connected. >> >> >> On Tue, Mar 26, 2019 at 2:03 PM Eyal Edri <eedri@redhat.com> wrote: >> >>> >>> >>> On Tue, Mar 26, 2019 at 3:56 PM Sandro Bonazzola < >>> sbonazzo@redhat.com> wrote: >>> >>>> >>>> >>>> Il giorno mar 26 mar 2019 alle ore 14:42 Marcin Sobczyk < >>>> msobczyk@redhat.com> ha scritto: >>>> >>>>> Ok, maybe I've found something. I was comparing two sets of log >>>>> files - one for basic-suite-4.3 run that succeeded and one that did not. >>>>> >>>>> The deployment procedure for host-1 is not complete in the >>>>> unsuccessful case. The last log entry is: >>>>> >>>>> 2019-03-22 17:54:49,973-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [6f2e097b] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host lago-basic-suite-4-3-host-1. Starting vdsm. >>>>> >>>>> In the successful case we can see: >>>>> >>>>> 2019-03-25 06:16:31,789-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host lago-basic-suite-4-3-host-1. Starting vdsm. >>>>> 2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventEnd STAGE closeup METHOD otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start (odeploycons.packages.vdsm.started) >>>>> 2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventStart STAGE closeup METHOD otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None) >>>>> 2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: ***L:INFO Starting ovirt-vmconsole-host-sshd >>>>> 2019-03-25 06:16:38,257-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] nextEvent: Log INFO Starting ovirt-vmconsole-host-sshd >>>>> 2019-03-25 06:16:38,261-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [3add0cac] EVENT_ID: VDS_INSTALL_IN_PROGRESS(509), Installing Host lago-basic-suite-4-3-host-1. Starting ovirt-vmconsole-host-sshd. >>>>> 2019-03-25 06:16:39,479-04 DEBUG [org.ovirt.otopi.dialog.MachineDialogParser] (VdsDeploy) [3add0cac] Got: **%EventEnd STAGE closeup METHOD otopi.plugins.ovirt_host_deploy.vmconsole.packages.Plugin._start (None) >>>>> ... >>>>> >>>>> >>>>> It looks like ovirt-host-deploy plugin did not respond? >>>>> >>>>> The timing of: >>>>> >>>>> https://gerrit.ovirt.org/#/c/98762/ >>>>> >>>>> kinda aligns with my patch which was merged for vdsm-4.3 and was >>>>> a suspect: >>>>> >>>>> https://gerrit.ovirt.org/#/c/98748/ >>>>> >>>>> Sandro, are you aware of any issues on ovirt-host-deploy plugin >>>>> that might have caused these? >>>>> >>>> >>>> I don't really see how a patch touching the automation >>>> configuratin for STDCI v2 can be the cause of this but please go ahead and >>>> blacklist last build. >>>> It should differ from previous one only by the timestamp of the >>>> rpm :-) >>>> >>> >> Ehhhh, yeah, I'm so tired of looking at log files I'm starting to >> see patterns everywhere... >> >> >> >>> If you want to verify a specific version of an RPM works , you can >>> use the manual job for it. >>> >>> >>>> >>>> >>>>> Dafna, can we temporarily blacklist the latest version of >>>>> ovirt-host-deploy to see if it fixes CQ? >>>>> >>>>> Marcin >>>>> On 3/25/19 2:45 PM, Dafna Ron wrote: >>>>> >>>>> yes. I suspect this is the problem. >>>>> In the host I can see in the yum log the times the vdsm packages >>>>> are installed: >>>>> [dron@dron post-002_bootstrap.py]$ less >>>>> lago-basic-suite-4-3-host-1/_var_log/yum.log |grep vdsm >>>>> Mar 25 07:48:49 Installed: >>>>> vdsm-api-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:48:52 Installed: >>>>> vdsm-yajsonrpc-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:49:00 Installed: >>>>> vdsm-common-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:49:00 Installed: >>>>> vdsm-hook-vhostmd-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:49:38 Installed: >>>>> vdsm-network-4.30.11-23.git276b602.el7.x86_64 >>>>> Mar 25 07:49:38 Installed: >>>>> vdsm-python-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:49:39 Installed: >>>>> vdsm-client-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:49:39 Installed: >>>>> vdsm-jsonrpc-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:49:39 Installed: >>>>> vdsm-http-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:49:42 Installed: >>>>> vdsm-hook-openstacknet-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:49:57 Installed: >>>>> vdsm-hook-vmfex-dev-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:49:58 Installed: vdsm-4.30.11-23.git276b602.el7.x86_64 >>>>> Mar 25 07:49:58 Installed: >>>>> vdsm-hook-ethtool-options-4.30.11-23.git276b602.el7.noarch >>>>> Mar 25 07:49:58 Installed: >>>>> vdsm-hook-fcoe-4.30.11-23.git276b602.el7.noarch >>>>> >>>>> and in the engine you can see that the engine already failed >>>>> installation: >>>>> >>>>> 2019-03-25 07:48:33,920-04 INFO [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false. Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST with role type ADMIN >>>>> >>>>> >>>>> On Mon, Mar 25, 2019 at 1:35 PM Dafna Ron <dron@redhat.com> >>>>> wrote: >>>>> >>>>>> I am also adding didi >>>>>> should this not be logged in host-deploy? >>>>>> >>>>>> >>>>>> On Mon, Mar 25, 2019 at 1:33 PM Marcin Sobczyk < >>>>>> msobczyk@redhat.com> wrote: >>>>>> >>>>>>> Indeed in my failed job, I can see, that 'vdsm-python' is >>>>>>> installed *after* engine's attempt to run 'vdsm-tool': >>>>>>> >>>>>>> lago-basic-suite-4-3-host-1/_var_log/yum.log >>>>>>> 403:Mar 22 17:54:12 Installed: vdsm-python-4.30.11-23.git276b602.el7.noarch >>>>>>> >>>>>>> lago-basic-suite-4-3-engine/_var_log/ovirt-engine/engine.log >>>>>>> 1580:2019-03-22 17:53:03,793-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Executing: '/usr/bin/vdsm-tool vdsm-id' >>>>>>> 1581:2019-03-22 17:53:04,043-04 WARN [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) [d86d9b56-2564-44a9-a8d4-01a4d7f5a529] Failed to initiate vdsm-id request on host: Command returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1' >>>>>>> >>>>>>> >>>>>>> On 3/25/19 2:22 PM, Dafna Ron wrote: >>>>>>> >>>>>>> and its only failing on one host as well. >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 25, 2019 at 1:17 PM Marcin Sobczyk < >>>>>>> msobczyk@redhat.com> wrote: >>>>>>> >>>>>>>> Hmmm, that would suggest, that 'vdsm-tool' command is not >>>>>>>> found on the host. There was no work done around 'vdsm-tool's 'vdsm-id' >>>>>>>> recently. >>>>>>>> On 3/25/19 2:08 PM, Dafna Ron wrote: >>>>>>>> >>>>>>>> Actually, I can see in engine that it is trying to start vdsm >>>>>>>> installation on the host: >>>>>>>> >>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] Entered SsoRestApiAuthFilter >>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] SsoRestApiAuthFilter authenticating with sso >>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] SsoRestApiAuthFilter authenticating using BEARER header >>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiAuthFilter] (default task-1) [] SsoRestApiAuthFilter successfully authenticated using BEARER header >>>>>>>> 2019-03-25 07:48:33,123-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) [] Entered SsoRestApiNegotiationFilter >>>>>>>> 2019-03-25 07:48:33,124-04 DEBUG [org.ovirt.engine.core.aaa.filters.SsoRestApiNegotiationFilter] (default task-1) [] SsoRestApiNegotiationFilter Not performing Negotiate Auth >>>>>>>> 2019-03-25 07:48:33,143-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call string is [{call getclusterforuserbyclustername(?, ?, ?)}] >>>>>>>> 2019-03-25 07:48:33,143-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure [GetClusterForUserByClusterName] compiled >>>>>>>> 2019-03-25 07:48:33,145-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: runQuery, params: [GetClusterByName, NameQueryParameters:{refresh='false', filtered='false'}], timeElapsed: 7ms >>>>>>>> 2019-03-25 07:48:33,161-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [] Compiled stored procedure. Call string is [{call getvmstaticbyname(?)}] >>>>>>>> 2019-03-25 07:48:33,161-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [] SqlCall for procedure [GetVmStaticByName] compiled >>>>>>>> 2019-03-25 07:48:33,206-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call string is [{call getvdsbyname(?)}] >>>>>>>> 2019-03-25 07:48:33,206-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure [GetVdsByName] compiled >>>>>>>> 2019-03-25 07:48:33,223-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getByName, params: [lago-basic-suite-4-3-host-1], timeElapsed: 20ms >>>>>>>> 2019-03-25 07:48:33,225-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call string is [{call getvdsbyhostname(?)}] >>>>>>>> 2019-03-25 07:48:33,225-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure [GetVdsByHostName] compiled >>>>>>>> 2019-03-25 07:48:33,252-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] method: getAllForHostname, params: [lago-basic-suite-4-3-host-1], timeElapsed: 28ms >>>>>>>> 2019-03-25 07:48:33,254-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Compiled stored procedure. Call string is [{call getstorage_poolsbyclusterid(?)}] >>>>>>>> 2019-03-25 07:48:33,254-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] SqlCall for procedure [Getstorage_poolsByClusterId] compiled >>>>>>>> 2019-03-25 07:48:33,262-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Connecting 'root@lago-basic-suite-4-3-host-1' >>>>>>>> 2019-03-25 07:48:33,461-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Connected: 'root@lago-basic-suite-4-3-host-1' >>>>>>>> 2019-03-25 07:48:33,467-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticating: 'root@lago-basic-suite-4-3-host-1' >>>>>>>> 2019-03-25 07:48:33,599-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Authenticated: 'root@lago-basic-suite-4-3-host-1' >>>>>>>> 2019-03-25 07:48:33,600-04 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Executing: '/usr/bin/vdsm-tool vdsm-id' >>>>>>>> 2019-03-25 07:48:33,902-04 WARN [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Failed to initiate vdsm-id request on host: Command returned failure code 127 during SSH session 'root@lago-basic-suite-4-3-host-1' >>>>>>>> 2019-03-25 07:48:33,920-04 INFO [org.ovirt.engine.core.bll.hostdeploy.AddVdsCommand] (default task-1) [6baa2a84-b322-42fd-855d-b25272a19a23] Running command: AddVdsCommand internal: false. Entities affected : ID: 319fcdf5-5941-46b1-a75c-d335bf500f82 Type: ClusterAction group CREATE_HOST with role type ADMIN >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 25, 2019 at 12:53 PM Dafna Ron <dron@redhat.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> we have another failing patch with the same test: >>>>>>>>> >>>>>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/360/ >>>>>>>>> >>>>>>>>> its obviously not related to the patch but there is >>>>>>>>> something that is causing these failures randomly. >>>>>>>>> From what I can see in the current failed job, you are >>>>>>>>> correct and host-1 and engine does not even try to deploy host-1. >>>>>>>>> >>>>>>>>> i can see host-1 is getting error 127 in lago for ntpd and I >>>>>>>>> can see network manager errors in the host messages log >>>>>>>>> >>>>>>>>> In the host messages log I can see several messages that I >>>>>>>>> suspect may cause issues in communication between engine and host: >>>>>>>>> >>>>>>>>> Mar 25 07:50:09 lago-basic-suite-4-3-host-1 >>>>>>>>> sasldblistusers2: _sasldb_getkeyhandle has failed >>>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: >>>>>>>>> error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>>>>> pair found >>>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: >>>>>>>>> error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>>>>> pair found >>>>>>>>> Mar 25 07:50:10 lago-basic-suite-4-3-host-1 saslpasswd2: >>>>>>>>> error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data >>>>>>>>> pair found >>>>>>>>> >>>>>>>>> >>>>>>>>> ces/12) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>>> bondscan-UMJa2S is being created... >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>>> bondscan-UMJa2S is being deleted... >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 >>>>>>>>> NetworkManager[2658]: <info> [1553514613.7774] manager: (bondscan-UMJa2S): >>>>>>>>> new Bond device (/org/freedesktop/NetworkManager/Devices/13) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> (unregistered net_device) (unregistering): Released all slaves >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>>> bondscan-liwvMR is being created... >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option fail_over_mac: invalid value (3) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option ad_select: invalid value (3) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option lacp_rate: mode dependency failed, not supported in >>>>>>>>> mode balance-rr(0) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option arp_validate: invalid value (7) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option xmit_hash_policy: invalid value (5) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 >>>>>>>>> NetworkManager[2658]: <info> [1553514613.8002] manager: (bondscan-liwvMR): >>>>>>>>> new Bond device (/org/freedesktop/NetworkManager/Devices/14) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option primary_reselect: invalid value (3) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option arp_all_targets: invalid value (2) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>>> bondscan-liwvMR is being deleted... >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> (unregistered net_device) (unregistering): Released all slaves >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>>> bondscan-liwvMR is being created... >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option fail_over_mac: invalid value (3) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option ad_select: invalid value (3) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option lacp_rate: mode dependency failed, not supported in >>>>>>>>> mode active-backup(1) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option arp_validate: invalid value (7) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 >>>>>>>>> NetworkManager[2658]: <info> [1553514613.8429] manager: (bondscan-liwvMR): >>>>>>>>> new Bond device (/org/freedesktop/NetworkManager/Devices/15) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option xmit_hash_policy: invalid value (5) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option primary_reselect: invalid value (3) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> bondscan-liwvMR: option arp_all_targets: invalid value (2) >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: bonding: >>>>>>>>> bondscan-liwvMR is being deleted... >>>>>>>>> Mar 25 07:50:13 lago-basic-suite-4-3-host-1 kernel: >>>>>>>>> (unregistered net_device) (unregistering): Released all slaves >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 25, 2019 at 11:24 AM Marcin Sobczyk < >>>>>>>>> msobczyk@redhat.com> wrote: >>>>>>>>> >>>>>>>>>> For the failed job, the engine didn't even try to deploy on >>>>>>>>>> host-1: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/artifact/bas... >>>>>>>>>> >>>>>>>>>> Martin, do you know what could be the reason for that? >>>>>>>>>> >>>>>>>>>> I can see in the logs for both successful and unsuccessful >>>>>>>>>> basic-suite-4.3 runs, that there is no 'ntpdate' on host-1: >>>>>>>>>> >>>>>>>>>> 2019-03-25 10:14:46,350::ssh.py::ssh::58::lago.ssh::DEBUG::Running d0c49b54 on lago-basic-suite-4-3-host-1: ntpdate -4 lago-basic-suite-4-3-engine >>>>>>>>>> 2019-03-25 10:14:46,383::ssh.py::ssh::81::lago.ssh::DEBUG::Command d0c49b54 on lago-basic-suite-4-3-host-1 returned with 127 >>>>>>>>>> 2019-03-25 10:14:46,384::ssh.py::ssh::96::lago.ssh::DEBUG::Command d0c49b54 on lago-basic-suite-4-3-host-1 errors: >>>>>>>>>> bash: ntpdate: command not found >>>>>>>>>> >>>>>>>>>> On host-0 everything is ok: >>>>>>>>>> >>>>>>>>>> 2019-03-25 10:14:46,917::ssh.py::ssh::58::lago.ssh::DEBUG::Running d11b2a64 on lago-basic-suite-4-3-host-0: ntpdate -4 lago-basic-suite-4-3-engine >>>>>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::81::lago.ssh::DEBUG::Command d11b2a64 on lago-basic-suite-4-3-host-0 returned with 0 >>>>>>>>>> 2019-03-25 10:14:53,088::ssh.py::ssh::89::lago.ssh::DEBUG::Command d11b2a64 on lago-basic-suite-4-3-host-0 output: >>>>>>>>>> 25 Mar 06:14:53 ntpdate[6646]: adjust time server 192.168.202.2 offset 0.017150 sec >>>>>>>>>> >>>>>>>>>> On 3/25/19 10:13 AM, Eyal Edri wrote: >>>>>>>>>> >>>>>>>>>> Still fails, now on a different component. ( >>>>>>>>>> ovirt-web-ui-extentions ) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/339/ >>>>>>>>>> >>>>>>>>>> On Fri, Mar 22, 2019 at 3:59 PM Dan Kenigsberg < >>>>>>>>>> danken@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Mar 22, 2019 at 3:21 PM Marcin Sobczyk < >>>>>>>>>>> msobczyk@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Dafna, >>>>>>>>>>>> >>>>>>>>>>>> in 'verify_add_hosts' we specifically wait for single >>>>>>>>>>>> host to be up with a timeout: >>>>>>>>>>>> >>>>>>>>>>>> 144 up_hosts = hosts_service.list(search='datacenter={} AND status=up'.format(DC_NAME)) >>>>>>>>>>>> 145 if len(up_hosts): >>>>>>>>>>>> 146 return True >>>>>>>>>>>> >>>>>>>>>>>> The log files say, that it took ~50 secs for one of the >>>>>>>>>>>> hosts to be up (seems reasonable) and no timeout is being reported. >>>>>>>>>>>> Just after running 'verify_add_hosts', we run >>>>>>>>>>>> 'add_master_storage_domain', which calls '_hosts_in_dc' function. >>>>>>>>>>>> That function does the exact same check, but it fails: >>>>>>>>>>>> >>>>>>>>>>>> 113 hosts = hosts_service.list(search='datacenter={} AND status=up'.format(dc_name)) >>>>>>>>>>>> 114 if hosts: >>>>>>>>>>>> 115 if random_host: >>>>>>>>>>>> 116 return random.choice(hosts) >>>>>>>>>>>> >>>>>>>>>>>> I don't think it is relevant to our current failure; but >>>>>>>>>>> I consider random_host=True as a bad practice. As if we do not have enough >>>>>>>>>>> moving parts, we are adding intentional randomness. Reproducibility is far >>>>>>>>>>> more important than coverage - particularly for a shared system test like >>>>>>>>>>> OST. >>>>>>>>>>> >>>>>>>>>>>> 117 else: >>>>>>>>>>>> 118 return sorted(hosts, key=lambda host: host.name) >>>>>>>>>>>> 119 raise RuntimeError('Could not find hosts that are up in DC %s' % dc_name) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm also not able to reproduce this issue locally on my >>>>>>>>>>>> server. The investigation continues... >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I think that it would be fair to take the filtering by >>>>>>>>>>> host state out of Engine and into the test, where we can easily log the >>>>>>>>>>> current status of each host. Then we'd have better understanding on the >>>>>>>>>>> next failure. >>>>>>>>>>> >>>>>>>>>>> On 3/22/19 1:17 PM, Marcin Sobczyk wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> sure, I'm on it - it's weird though, I did ran 4.3 basic >>>>>>>>>>>> suite for this patch manually and everything was ok. >>>>>>>>>>>> On 3/22/19 1:05 PM, Dafna Ron wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> We are failing branch 4.3 for test: >>>>>>>>>>>> 002_bootstrap.add_master_storage_domain >>>>>>>>>>>> >>>>>>>>>>>> It seems that in one of the hosts, the vdsm is not >>>>>>>>>>>> starting >>>>>>>>>>>> there is nothing in vdsm.log or in supervdsm.log >>>>>>>>>>>> >>>>>>>>>>>> CQ identified this patch as the suspected root cause: >>>>>>>>>>>> >>>>>>>>>>>> https://gerrit.ovirt.org/#/c/98748/ - vdsm: client: Add >>>>>>>>>>>> support for flow id >>>>>>>>>>>> >>>>>>>>>>>> Milan, Marcin, can you please have a look? >>>>>>>>>>>> >>>>>>>>>>>> full logs: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> http://jenkins.ovirt.org/job/ovirt-4.3_change-queue-tester/326/artifact/basi... >>>>>>>>>>>> >>>>>>>>>>>> the only error I can see is about host not being up >>>>>>>>>>>> (makes sense as vdsm is not running) >>>>>>>>>>>> >>>>>>>>>>>> Stacktrace >>>>>>>>>>>> >>>>>>>>>>>> File "/usr/lib64/python2.7/unittest/case.py", line 369, in run >>>>>>>>>>>> testMethod() >>>>>>>>>>>> File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest >>>>>>>>>>>> self.test(*self.arg) >>>>>>>>>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in wrapped_test >>>>>>>>>>>> test() >>>>>>>>>>>> File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper >>>>>>>>>>>> return func(get_test_prefix(), *args, **kwargs) >>>>>>>>>>>> File "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", line 417, in add_master_storage_domain >>>>>>>>>>>> add_iscsi_storage_domain(prefix) >>>>>>>>>>>> File "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", line 561, in add_iscsi_storage_domain >>>>>>>>>>>> host=_random_host_from_dc(api, DC_NAME), >>>>>>>>>>>> File "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", line 122, in _random_host_from_dc >>>>>>>>>>>> return _hosts_in_dc(api, dc_name, True) >>>>>>>>>>>> File "/home/jenkins/workspace/ovirt-4.3_change-queue-tester/ovirt-system-tests/basic-suite-4.3/test-scenarios/002_bootstrap.py", line 119, in _hosts_in_dc >>>>>>>>>>>> raise RuntimeError('Could not find hosts that are up in DC %s' % dc_name) >>>>>>>>>>>> 'Could not find hosts that are up in DC test-dc\n-------------------- >> begin captured logging << --------------------\nlago.ssh: DEBUG: start task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: end task:937bdea7-a2a3-47ad-9383-36647ea37ddf:Get ssh client for lago-basic-suite-4-3-engine:\nlago.ssh: DEBUG: Running c07b5ee2 on lago-basic-suite-4-3-engine: cat /root/multipath.txt\nlago.ssh: DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine returned with 0\nlago.ssh: DEBUG: Command c07b5ee2 on lago-basic-suite-4-3-engine output:\n 3600140516f88cafa71243648ea218995\n360014053e28f60001764fed9978ec4b3\n360014059edc777770114a6484891dcf1\n36001405d93d8585a50d43a4ad0bd8d19\n36001405e31361631de14bcf87d43e55a\n\n----------- >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Devel mailing list -- devel@ovirt.org >>>>>>>>>>>> To unsubscribe send an email to devel-leave@ovirt.org >>>>>>>>>>>> Privacy Statement: >>>>>>>>>>>> https://www.ovirt.org/site/privacy-policy/ >>>>>>>>>>>> oVirt Code of Conduct: >>>>>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>>>>>> List Archives: >>>>>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/J4NCHXTK5ZYLXW... >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Devel mailing list -- devel@ovirt.org >>>>>>>>>>> To unsubscribe send an email to devel-leave@ovirt.org >>>>>>>>>>> Privacy Statement: >>>>>>>>>>> https://www.ovirt.org/site/privacy-policy/ >>>>>>>>>>> oVirt Code of Conduct: >>>>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>>>>> List Archives: >>>>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ULS4OKU2YZFDQT... >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Eyal edri >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> MANAGER >>>>>>>>>> >>>>>>>>>> RHV/CNV DevOps >>>>>>>>>> >>>>>>>>>> EMEA VIRTUALIZATION R&D >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Red Hat EMEA <https://www.redhat.com/> >>>>>>>>>> <https://red.ht/sig> TRIED. TESTED. TRUSTED. >>>>>>>>>> <https://redhat.com/trusted> >>>>>>>>>> phone: +972-9-7692018 >>>>>>>>>> irc: eedri (on #tlv #rhev-dev #rhev-integ) >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Devel mailing list -- devel@ovirt.org >>>>>>>>>> To unsubscribe send an email to devel-leave@ovirt.org >>>>>>>>>> Privacy Statement: >>>>>>>>>> https://www.ovirt.org/site/privacy-policy/ >>>>>>>>>> oVirt Code of Conduct: >>>>>>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>>>>>> List Archives: >>>>>>>>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/XOP6ZCUIDUVC2X... >>>>>>>>>> >>>>>>>>> >>>> >>>> -- >>>> >>>> SANDRO BONAZZOLA >>>> >>>> MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV >>>> >>>> Red Hat EMEA <https://www.redhat.com/> >>>> >>>> sbonazzo@redhat.com >>>> <https://red.ht/sig> >>>> _______________________________________________ >>>> Devel mailing list -- devel@ovirt.org >>>> To unsubscribe send an email to devel-leave@ovirt.org >>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>> oVirt Code of Conduct: >>>> https://www.ovirt.org/community/about/community-guidelines/ >>>> List Archives: >>>> https://lists.ovirt.org/archives/list/devel@ovirt.org/message/SR7YJIIZHM36KR... >>>> >>> >>> >>> -- >>> <div dir="ltr" >>> class="gmail-m_-929359959023254178gmail-m_-3007985368740475924gmail-m_-3898735433743 >>> >> _______________________________________________
Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JVE4CPW2T53KE6...