On Mon, 23 Jul 2018 at 15:03, Martin Perina <mperina(a)redhat.com> wrote:
On Mon, Jul 23, 2018 at 1:32 PM, Dafna Ron <dron(a)redhat.com> wrote:
> Hi,
>
> the issue seems to be that host-1 stopped responding and I can see some
> fluetd errors which we should look at.
>
> Jira opened to track this issue:
>
https://ovirt-jira.atlassian.net/browse/OVIRT-2363
>
> Martin, I also added you to the Jira - can you please have a look?
>
> error from node-1 messages log:
> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:14 -0400 [warn]: detached forwarding server
> 'lago-basic-suite-master-engine'
host="lago-basic-suite-master-engine"
> port=24224 phi=16.275347714068506
> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd:
> ["lago-basic-suite-master-engine",
"lago-basic-suite-master-engine",
> "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine",
> "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine"]
> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:14 -0400 fluent.warn:
>
{"host":"lago-basic-suite-master-engine","port":24224,"phi":16.275347714068506,"message":"detached
> forwarding server 'lago-basic-suite-master-engine'
> host=\"lago-basic-suite-master-engine\" port=24224
phi=16.275347714068506"}
> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:15 -0400 [warn]: detached forwarding server
> 'lago-basic-suite-master-engine'
host="lago-basic-suite-master-engine"
> port=24224 phi=16.70444149784817
> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd:
> ["lago-basic-suite-master-engine",
"lago-basic-suite-master-engine",
> "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine",
> "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine"]
> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:15 -0400 fluent.warn:
>
{"host":"lago-basic-suite-master-engine","port":24224,"phi":16.70444149784817,"message":"detached
> forwarding server 'lago-basic-suite-master-engine'
> host=\"lago-basic-suite-master-engine\" port=24224
phi=16.70444149784817"}
> Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command
> Invoked with warn=False executable=None _uses_shell=False
> _raw_params=systemctl is-active 'collectd' removes=None argv=None
> creates=None chdir=None stdin=None
> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New
> session 29 of user root.
> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session
> 29 of user root.
> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session
> 29 of user root.
> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed
> session 29.
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]: failed to flush the buffer.
> error_class="RuntimeError" error="no nodes are available"
> plugin_id="object:151a620"
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]: retry count exceededs limit.
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]:
> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/out_forward.rb:222:in
> `write_objects'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]:
> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]:
> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in
> `write_chunk'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]:
> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]:
> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [warn]:
> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run'
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 [error]: throwing away old logs.
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 fluent.warn:
{"error_class":"RuntimeError","error":"no nodes
> are
available","plugin_id":"object:151a620","message":"failed
to flush the
> buffer. error_class=\"RuntimeError\" error=\"no nodes are
available\"
> plugin_id=\"object:151a620\""}
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 fluent.warn: {"message":"retry count exceededs
limit."}
> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
> 05:09:27 -0400 fluent.error: {"message":"throwing away old
logs."}
>
>
>
> Thanks.
> Dafna
>
Hi,
I can see in vdsm.log that it received a kill signal:
2018-07-23 05:24:26,735-0400 INFO (MainThread) [vds] Received signal 15,
shutting down (vdsmd:68)
And in /var/log/messages I found that mom was killed:
Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM
instance configured for VDSM purposes...
...
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
stop-sigterm timed out. Killing.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service:
main process exited, code=killed, status=9/KILL
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM
instance configured for VDSM purposes.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit
mom-vdsm.service entered failed state.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
failed.
So Didi/Shirly/Martin can fluentd error be related to mom shutdown? And
could this be a cause of VDSM shutdown?
Hi,
Mom is not related to fluentd and mom shutdown should not cause vdsm
shutdown.
The service dependency between vdsmd and mom-vdsm is weak (using
Wants=mom-vdsm.service).
Looking at /var/log/messages both mom-vdsm and vdsmd services were
restarted:
Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM
instance configured for VDSM purposes...
...
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM
instance configured for VDSM purposes.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit
mom-vdsm.service entered failed state.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
failed.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopping Virtual
Desktop Server Manager...
...
Jul 23 05:24:29 lago-basic-suite-master-host-1 systemd: Stopped Virtual
Desktop Server Manager.
...
Jul 23 05:25:26 lago-basic-suite-master-host-1 systemd: Starting Virtual
Desktop Server Manager...
...
Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started Virtual
Desktop Server Manager.
Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started MOM
instance configured for VDSM purposes.
Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Starting MOM
instance configured for VDSM purposes...
...
Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Started MOM
instance configured for VDSM purposes.
Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Starting MOM
instance configured for VDSM purposes...
The error in 008_basic_ui_sanity.py.junit.xml probably means that the
docker executable was not found on the machine running the test. Can it be
the cause of the failure?
<error type="exceptions.OSError"
message="[Errno 2] No such file or directory
-------------------- >> begin captured stdout <<
---------------------
executing shell: docker ps
--------------------- >> end captured stdout << ---------------
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod()
File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in
wrapped_test test()
File
"/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py",
line 169, in start_grid _docker_cleanup()
File
"/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py",
line 136, in _docker_cleanup _shell(["docker", "ps"])
File
"/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py",
line 119, in _shell stderr=subprocess.PIPE)
File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread,
errwrite)
File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception [Errno 2] No such file or directory
Andrej
>
>
> On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <jenkins(a)ovirt.org>
> wrote:
>
>> Change 92882,9 (ovirt-engine) is probably the reason behind recent
>> system test
>> failures in the "ovirt-master" change queue and needs to be fixed.
>>
>> This change had been removed from the testing queue. Artifacts build
>> from this
>> change will not be released until it is fixed.
>>
>> For further details about the change see:
>>
https://gerrit.ovirt.org/#/c/92882/9
>>
>> For failed test results see:
>>
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/
>> _______________________________________________
>> Infra mailing list -- infra(a)ovirt.org
>> To unsubscribe send an email to infra-leave(a)ovirt.org
>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct:
>>
https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>>
https://lists.ovirt.org/archives/list/infra@ovirt.org/message/6LYYXSGM4LQ...
>>
>
>
--
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.
_______________________________________________
Infra mailing list -- infra(a)ovirt.org
To unsubscribe send an email to infra-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/infra@ovirt.org/message/KXBI2VR5TXH...