On Mon, Jul 23, 2018 at 1:32 PM, Dafna Ron <dron@redhat.com> wrote:
Hi,

the issue seems to be that host-1 stopped responding and I can see some fluetd errors which we should look at.

Jira opened to track this issue: https://ovirt-jira.atlassian.net/browse/OVIRT-2363

Martin, I also added you to the Jira - can you please have a look?

error from node-1 messages log:
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:14 -0400 [warn]: detached forwarding server 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine" port=24224 phi=16.275347714068506
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:14 -0400 fluent.warn: {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.275347714068506,"message":"detached forwarding server 'lago-basic-suite-master-engine' host=\"lago-basic-suite-master-engine\" port=24224 phi=16.275347714068506"}
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:15 -0400 [warn]: detached forwarding server 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine" port=24224 phi=16.70444149784817
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:15 -0400 fluent.warn: {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.70444149784817,"message":"detached forwarding server 'lago-basic-suite-master-engine' host=\"lago-basic-suite-master-engine\" port=24224 phi=16.70444149784817"}
Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command Invoked with warn=False executable=None _uses_shell=False _raw_params=systemctl is-active 'collectd' removes=None argv=None creates=None chdir=None stdin=None
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New session 29 of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session 29 of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session 29 of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed session 29.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: failed to flush the buffer. error_class="RuntimeError" error="no nodes are available" plugin_id="object:151a620"
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: retry count exceededs limit.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/out_forward.rb:222:in `write_objects'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in `write_chunk'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [error]: throwing away old logs.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 fluent.warn: {"error_class":"RuntimeError","error":"no nodes are available","plugin_id":"object:151a620","message":"failed to flush the buffer. error_class=\"RuntimeError\" error=\"no nodes are available\" plugin_id=\"object:151a620\""}
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 fluent.warn: {"message":"retry count exceededs limit."}
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 fluent.error: {"message":"throwing away old logs."}



Thanks.
Dafna

​Hi,

I can see in vdsm.log that it received a kill signal:

2018-07-23 05:24:26,735-0400 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsmd:68)

​And in /var/log/messages I found that mom was killed:

Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM instance configured for VDSM purposes...

...

Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service stop-sigterm timed out. Killing.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service: main process exited, code=killed, status=9/KILL
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM instance configured for VDSM purposes.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit mom-vdsm.service entered failed state.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service failed.

So Didi/Shirly/Martin can fluentd error be related to mom shutdown? And could this be a cause of VDSM shutdown?




On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <jenkins@ovirt.org> wrote:
Change 92882,9 (ovirt-engine) is probably the reason behind recent system test
failures in the "ovirt-master" change queue and needs to be fixed.

This change had been removed from the testing queue. Artifacts build from this
change will not be released until it is fixed.

For further details about the change see:
https://gerrit.ovirt.org/#/c/92882/9

For failed test results see:
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/
_______________________________________________
Infra mailing list -- infra@ovirt.org
To unsubscribe send an email to infra-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/6LYYXSGM4LQSRVSYY3IJEIE64LW27TJM/




--
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.