On 24 July 2018 at 19:42, Greg Sheremeta <gshereme@redhat.com> wrote:


On Tue, Jul 24, 2018 at 10:53 AM Andrej Krejcir <akrejcir@redhat.com> wrote:


On Mon, 23 Jul 2018 at 15:03, Martin Perina <mperina@redhat.com> wrote:


On Mon, Jul 23, 2018 at 1:32 PM, Dafna Ron <dron@redhat.com> wrote:
Hi,

the issue seems to be that host-1 stopped responding and I can see some fluetd errors which we should look at.

Jira opened to track this issue: https://ovirt-jira.atlassian.net/browse/OVIRT-2363

Martin, I also added you to the Jira - can you please have a look?

error from node-1 messages log:
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:14 -0400 [warn]: detached forwarding server 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine" port=24224 phi=16.275347714068506
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:14 -0400 fluent.warn: {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.275347714068506,"message":"detached forwarding server 'lago-basic-suite-master-engine' host=\"lago-basic-suite-master-engine\" port=24224 phi=16.275347714068506"}
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:15 -0400 [warn]: detached forwarding server 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine" port=24224 phi=16.70444149784817
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:15 -0400 fluent.warn: {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.70444149784817,"message":"detached forwarding server 'lago-basic-suite-master-engine' host=\"lago-basic-suite-master-engine\" port=24224 phi=16.70444149784817"}
Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command Invoked with warn=False executable=None _uses_shell=False _raw_params=systemctl is-active 'collectd' removes=None argv=None creates=None chdir=None stdin=None
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New session 29 of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session 29 of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session 29 of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed session 29.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: failed to flush the buffer. error_class="RuntimeError" error="no nodes are available" plugin_id="object:151a620"
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: retry count exceededs limit.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/out_forward.rb:222:in `write_objects'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in `write_chunk'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [error]: throwing away old logs.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 fluent.warn: {"error_class":"RuntimeError","error":"no nodes are available","plugin_id":"object:151a620","message":"failed to flush the buffer. error_class=\"RuntimeError\" error=\"no nodes are available\" plugin_id=\"object:151a620\""}
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 fluent.warn: {"message":"retry count exceededs limit."}
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 fluent.error: {"message":"throwing away old logs."}



Thanks.
Dafna

​Hi,

I can see in vdsm.log that it received a kill signal:

2018-07-23 05:24:26,735-0400 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:68)

​And in /var/log/messages I found that mom was killed:

Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM instance configured for VDSM purposes...

...

Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service stop-sigterm timed out. Killing.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service: main process exited, code=killed, status=9/KILL
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM instance configured for VDSM purposes.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit mom-vdsm.service entered failed state.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service failed.

So Didi/Shirly/Martin can fluentd error be related to mom shutdown? And could this be a cause of VDSM shutdown?

​Hi,

Mom is not related to fluentd and mom shutdown should not cause vdsm shutdown.
​ 
The service dependency between vdsmd and mom-vdsm is weak (using Wants=mom-vdsm.service).

Looking at /var/log/messages both mom-vdsm and vdsmd services were restarted:

Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM instance configured for VDSM purposes...
...
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM instance configured for VDSM purposes.                                                                                                        
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit mom-vdsm.service entered failed state.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service failed.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopping Virtual Desktop Server Manager...
...
Jul 23 05:24:29 lago-basic-suite-master-host-1 systemd: Stopped Virtual Desktop Server Manager.
...
Jul 23 05:25:26 lago-basic-suite-master-host-1 systemd: Starting Virtual Desktop Server Manager...
...
Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started Virtual Desktop Server Manager.
Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started MOM instance configured for VDSM purposes.
Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Starting MOM instance configured for VDSM purposes...
...
Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Started MOM instance configured for VDSM purposes.
Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Starting MOM instance configured for VDSM purposes...
 


The error in 008_basic_ui_sanity.py.junit.xml probably means that the docker executable was not found on the machine running the test. Can it be the cause of the failure?

<error type="exceptions.OSError" 
       message="[Errno 2] No such file or directory 
       -------------------- >> begin captured stdout << ---------------------
       executing shell: docker ps 
       --------------------- >> end captured stdout << ---------------

File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() 
File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) 
File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in wrapped_test test() 
File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py", line 169, in start_grid _docker_cleanup() 
File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py", line 136, in _docker_cleanup _shell(["docker", "ps"]) 
File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py", line 119, in _shell stderr=subprocess.PIPE) 
File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread, errwrite) 
File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child raise child_exception [Errno 2] No such file or directory


Yep, looks like docker isn't installed. And yes that would fail it. Any recent changes? I know Gal is working on some containerization of this [1], but I don't know what's been merged.

It seems that there was a short period of time last week where docker was not available in CentOS, and while our mirrors server should protect against this type of issue, we experienced some issues with it (Basically ran out of disk space), so the jobs failed over to the upstream CentOs repos, and Docker installation in mock failed.
 


[1] Change I5af15dce: Adjust UI test to run inside STDCI container | https://gerrit.ovirt.org/#/c/93074/
 

​Andrej​




On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <jenkins@ovirt.org> wrote:
Change 92882,9 (ovirt-engine) is probably the reason behind recent system test
failures in the "ovirt-master" change queue and needs to be fixed.

This change had been removed from the testing queue. Artifacts build from this
change will not be released until it is fixed.

For further details about the change see:
https://gerrit.ovirt.org/#/c/92882/9

For failed test results see:
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/
_______________________________________________
Infra mailing list -- infra@ovirt.org
To unsubscribe send an email to infra-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/6LYYXSGM4LQSRVSYY3IJEIE64LW27TJM/




--
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.
_______________________________________________
Infra mailing list -- infra@ovirt.org
To unsubscribe send an email to infra-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/KXBI2VR5TXH2FRBOS3ASV3YPOTJZ52RB/
_______________________________________________
Infra mailing list -- infra@ovirt.org
To unsubscribe send an email to infra-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/AD5NAECNGUW4LYJFC5C67TP4SMAY3ZW2/


--

GREG SHEREMETA

SENIOR SOFTWARE ENGINEER - TEAM LEAD - RHV UX

Red Hat NA

gshereme@redhat.com    IRC: gshereme


_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/W6BR572DZKYDD6F7E2OBX2725FLLEMXW/




--
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted