On Mon, Jul 23, 2018 at 1:32 PM, Dafna Ron <dron(a)redhat.com> wrote:
Hi,
the issue seems to be that host-1 stopped responding and I can see some
fluetd errors which we should look at.
Jira opened to track this issue:
https://ovirt-jira.atlassian.
net/browse/OVIRT-2363
Martin, I also added you to the Jira - can you please have a look?
error from node-1 messages log:
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:14 -0400 [warn]: detached forwarding server
'lago-basic-suite-master-engine'
host="lago-basic-suite-master-engine" port=24224 phi=16.275347714068506
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd:
["lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:14 -0400 fluent.warn: {"host":"lago-basic-suite-
master-engine","port":24224,"phi":16.275347714068506,"message":"detached
forwarding server 'lago-basic-suite-master-engine'
host=\"lago-basic-suite-master-engine\" port=24224
phi=16.275347714068506"}
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:15 -0400 [warn]: detached forwarding server
'lago-basic-suite-master-engine'
host="lago-basic-suite-master-engine" port=24224 phi=16.70444149784817
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd:
["lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:15 -0400 fluent.warn: {"host":"lago-basic-suite-
master-engine","port":24224,"phi":16.70444149784817,"message":"detached
forwarding server 'lago-basic-suite-master-engine'
host=\"lago-basic-suite-master-engine\" port=24224
phi=16.70444149784817"}
Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command
Invoked with warn=False executable=None _uses_shell=False
_raw_params=systemctl is-active 'collectd' removes=None argv=None
creates=None chdir=None stdin=None
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New session
29 of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session 29
of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session
29 of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed
session 29.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 [warn]: failed to flush the buffer.
error_class="RuntimeError" error="no nodes are available"
plugin_id="object:151a620"
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 [warn]: retry count exceededs limit.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
0.12.42/lib/fluent/plugin/out_forward.rb:222:in `write_objects'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
0.12.42/lib/fluent/output.rb:490:in `write'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
0.12.42/lib/fluent/buffer.rb:354:in `write_chunk'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
0.12.42/lib/fluent/buffer.rb:333:in `pop'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
0.12.42/lib/fluent/output.rb:342:in `try_flush'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-
0.12.42/lib/fluent/output.rb:149:in `run'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 [error]: throwing away old logs.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 fluent.warn:
{"error_class":"RuntimeError","error":"no
nodes are
available","plugin_id":"object:151a620","message":"failed
to
flush the buffer. error_class=\"RuntimeError\" error=\"no nodes are
available\" plugin_id=\"object:151a620\""}
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 fluent.warn: {"message":"retry count exceededs
limit."}
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
05:09:27 -0400 fluent.error: {"message":"throwing away old logs."}
Thanks.
Dafna
Hi,
I can see in vdsm.log that it received a kill signal:
2018-07-23 05:24:26,735-0400 INFO (MainThread) [vds] Received signal 15,
shutting down (vdsmd:68)
And in /var/log/messages I found that mom was killed:
Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM
instance configured for VDSM purposes...
...
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
stop-sigterm timed out. Killing.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service:
main process exited, code=killed, status=9/KILL
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM
instance configured for VDSM purposes.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit
mom-vdsm.service entered failed state.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
failed.
So Didi/Shirly/Martin can fluentd error be related to mom shutdown? And
could this be a cause of VDSM shutdown?
On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <jenkins(a)ovirt.org> wrote:
> Change 92882,9 (ovirt-engine) is probably the reason behind recent system
> test
> failures in the "ovirt-master" change queue and needs to be fixed.
>
> This change had been removed from the testing queue. Artifacts build from
> this
> change will not be released until it is fixed.
>
> For further details about the change see:
>
https://gerrit.ovirt.org/#/c/92882/9
>
> For failed test results see:
>
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/
> _______________________________________________
> Infra mailing list -- infra(a)ovirt.org
> To unsubscribe send an email to infra-leave(a)ovirt.org
> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
https://www.ovirt.org/communit
> y/about/community-guidelines/
> List Archives:
https://lists.ovirt.org/archiv
> es/list/infra(a)ovirt.org/message/6LYYXSGM4LQSRVSYY3IJEIE64LW27TJM/
>
--
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.