On Tue, Jul 24, 2018 at 10:53 AM Andrej Krejcir <akrejcir(a)redhat.com> wrote:
On Mon, 23 Jul 2018 at 15:03, Martin Perina <mperina(a)redhat.com> wrote:
>
>
> On Mon, Jul 23, 2018 at 1:32 PM, Dafna Ron <dron(a)redhat.com> wrote:
>
>> Hi,
>>
>> the issue seems to be that host-1 stopped responding and I can see some
>> fluetd errors which we should look at.
>>
>> Jira opened to track this issue:
>>
https://ovirt-jira.atlassian.net/browse/OVIRT-2363
>>
>> Martin, I also added you to the Jira - can you please have a look?
>>
>> error from node-1 messages log:
>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:14 -0400 [warn]: detached forwarding server
>> 'lago-basic-suite-master-engine'
host="lago-basic-suite-master-engine"
>> port=24224 phi=16.275347714068506
>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd:
>> ["lago-basic-suite-master-engine",
"lago-basic-suite-master-engine",
>> "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine",
>> "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine"]
>> Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:14 -0400 fluent.warn:
>>
{"host":"lago-basic-suite-master-engine","port":24224,"phi":16.275347714068506,"message":"detached
>> forwarding server 'lago-basic-suite-master-engine'
>> host=\"lago-basic-suite-master-engine\" port=24224
phi=16.275347714068506"}
>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:15 -0400 [warn]: detached forwarding server
>> 'lago-basic-suite-master-engine'
host="lago-basic-suite-master-engine"
>> port=24224 phi=16.70444149784817
>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd:
>> ["lago-basic-suite-master-engine",
"lago-basic-suite-master-engine",
>> "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine",
>> "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine"]
>> Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:15 -0400 fluent.warn:
>>
{"host":"lago-basic-suite-master-engine","port":24224,"phi":16.70444149784817,"message":"detached
>> forwarding server 'lago-basic-suite-master-engine'
>> host=\"lago-basic-suite-master-engine\" port=24224
phi=16.70444149784817"}
>> Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command
>> Invoked with warn=False executable=None _uses_shell=False
>> _raw_params=systemctl is-active 'collectd' removes=None argv=None
>> creates=None chdir=None stdin=None
>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New
>> session 29 of user root.
>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session
>> 29 of user root.
>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session
>> 29 of user root.
>> Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed
>> session 29.
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 [warn]: failed to flush the buffer.
>> error_class="RuntimeError" error="no nodes are available"
>> plugin_id="object:151a620"
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 [warn]: retry count exceededs limit.
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 [warn]:
>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/out_forward.rb:222:in
>> `write_objects'
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 [warn]:
>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write'
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 [warn]:
>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in
>> `write_chunk'
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 [warn]:
>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop'
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 [warn]:
>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush'
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 [warn]:
>> /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run'
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 [error]: throwing away old logs.
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 fluent.warn:
{"error_class":"RuntimeError","error":"no nodes
>> are
available","plugin_id":"object:151a620","message":"failed
to flush the
>> buffer. error_class=\"RuntimeError\" error=\"no nodes are
available\"
>> plugin_id=\"object:151a620\""}
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 fluent.warn: {"message":"retry count exceededs
limit."}
>> Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23
>> 05:09:27 -0400 fluent.error: {"message":"throwing away old
logs."}
>>
>>
>>
>> Thanks.
>> Dafna
>>
>
> Hi,
>
> I can see in vdsm.log that it received a kill signal:
>
> 2018-07-23 05:24:26,735-0400 INFO (MainThread) [vds] Received signal 15,
> shutting down (vdsmd:68)
>
> And in /var/log/messages I found that mom was killed:
>
> Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM
> instance configured for VDSM purposes...
>
> ...
>
> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
> stop-sigterm timed out. Killing.
> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service:
> main process exited, code=killed, status=9/KILL
> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM
> instance configured for VDSM purposes.
> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit
> mom-vdsm.service entered failed state.
> Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
> failed.
>
> So Didi/Shirly/Martin can fluentd error be related to mom shutdown? And
> could this be a cause of VDSM shutdown?
>
> Hi,
Mom is not related to fluentd and mom shutdown should not cause vdsm
shutdown.
The service dependency between vdsmd and mom-vdsm is weak (using
Wants=mom-vdsm.service).
Looking at /var/log/messages both mom-vdsm and vdsmd services were
restarted:
Jul 23 05:24:16 lago-basic-suite-master-host-1 systemd: Stopping MOM
instance configured for VDSM purposes...
...
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopped MOM
instance configured for VDSM purposes.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Unit
mom-vdsm.service entered failed state.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: mom-vdsm.service
failed.
Jul 23 05:24:26 lago-basic-suite-master-host-1 systemd: Stopping Virtual
Desktop Server Manager...
...
Jul 23 05:24:29 lago-basic-suite-master-host-1 systemd: Stopped Virtual
Desktop Server Manager.
...
Jul 23 05:25:26 lago-basic-suite-master-host-1 systemd: Starting Virtual
Desktop Server Manager...
...
Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started Virtual
Desktop Server Manager.
Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Started MOM
instance configured for VDSM purposes.
Jul 23 05:25:29 lago-basic-suite-master-host-1 systemd: Starting MOM
instance configured for VDSM purposes...
...
Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Started MOM
instance configured for VDSM purposes.
Jul 23 05:25:34 lago-basic-suite-master-host-1 systemd: Starting MOM
instance configured for VDSM purposes...
The error in 008_basic_ui_sanity.py.junit.xml probably means that the
docker executable was not found on the machine running the test. Can it
be the cause of the failure?
<error type="exceptions.OSError"
message="[Errno 2] No such file or directory
-------------------- >> begin captured stdout <<
---------------------
executing shell: docker ps
--------------------- >> end captured stdout << ---------------
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
testMethod()
File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in
wrapped_test test()
File
"/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py",
line 169, in start_grid _docker_cleanup()
File
"/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py",
line 136, in _docker_cleanup _shell(["docker", "ps"])
File
"/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/008_basic_ui_sanity.py",
line 119, in _shell stderr=subprocess.PIPE)
File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread,
errwrite)
File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception [Errno 2] No such file or directory
Yep, looks like docker isn't installed. And yes that would fail it. Any
recent changes? I know Gal is working on some containerization of this [1],
but I don't know what's been merged.
[1] Change I5af15dce: Adjust UI test to run inside STDCI container |
Andrej
>>
>>
>> On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <jenkins(a)ovirt.org>
>> wrote:
>>
>>> Change 92882,9 (ovirt-engine) is probably the reason behind recent
>>> system test
>>> failures in the "ovirt-master" change queue and needs to be fixed.
>>>
>>> This change had been removed from the testing queue. Artifacts build
>>> from this
>>> change will not be released until it is fixed.
>>>
>>> For further details about the change see:
>>>
https://gerrit.ovirt.org/#/c/92882/9
>>>
>>> For failed test results see:
>>>
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/
>>> _______________________________________________
>>> Infra mailing list -- infra(a)ovirt.org
>>> To unsubscribe send an email to infra-leave(a)ovirt.org
>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
>>>
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/infra@ovirt.org/message/6LYYXSGM4LQ...
>>>
>>
>>
>
>
> --
> Martin Perina
> Associate Manager, Software Engineering
> Red Hat Czech s.r.o.
> _______________________________________________
> Infra mailing list -- infra(a)ovirt.org
> To unsubscribe send an email to infra-leave(a)ovirt.org
> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
>
https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
>
https://lists.ovirt.org/archives/list/infra@ovirt.org/message/KXBI2VR5TXH...
>
_______________________________________________
Infra mailing list -- infra(a)ovirt.org
To unsubscribe send an email to infra-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/infra@ovirt.org/message/AD5NAECNGUW...
--
GREG SHEREMETA
SENIOR SOFTWARE ENGINEER - TEAM LEAD - RHV UX
Red Hat NA
<