[ovirt-devel] [ OST Failure Report ] [ oVirt master ] [ 2017-08-30 ] [add_hosts]

Martin Perina mperina at redhat.com
Wed Aug 30 19:20:25 UTC 2017


On Wed, Aug 30, 2017 at 8:43 PM, Dan Kenigsberg <danken at redhat.com> wrote:

> On Wed, Aug 30, 2017 at 6:47 PM, Dominik Holler <dholler at redhat.com>
> wrote:
> > On Wed, 30 Aug 2017 13:31:38 +0200
> > Dominik Holler <dholler at redhat.com> wrote:
> >
> >> On Wed, 30 Aug 2017 14:18:49 +0300
> >> Dan Kenigsberg <danken at redhat.com> wrote:
> >>
> >> > On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren <bkorren at redhat.com>
> >> > wrote:
> >> > > Test failed: [ 002_bootstrap.add_hosts ]
> >> > >
> >> > > Link to suspected patches:
> >> > >
> >> > > We suspect this is due to change to LLDPAD in upstream CentOS
> >> > > repos. We can't tell the exact point it was introduced because
> >> > > other ovirt-engine regressions introduced too much noise into the
> >> > > system.
> >> > >
> >> > > It also seems that the failure is not 100% reproducible sine we
> >> > > have runs that do not encounter it.
> >> > >
> >> > > Link to Job:
> >> > > http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151
> >> > >
> >> > > (This job checked a specific patch to ovirt-log-collector but
> >> > > failure seems unrelated, and also happens on vanille 'tested' repo
> >> > > right now).
> >> > >
> >> > > Link to all logs:
> >> > > VDSM logs seem to be most relevant here:
> >> > >
> >> > > http://jenkins.ovirt.org/job/ovirt-master_change-queue-
> tester/2151/artifact/exported-artifacts/basic-suit-master-
> el7/test_logs/basic-suite-master/post-002_bootstrap.py/
> lago-basic-suite-master-host-0/_var_log/vdsm/
> >> > >
> >> > > Error snippet from log:
> >> > >
> >> > > - From VDSM logs (note it exists very quickly):
> >> > >
> >> > > <error>
> >> > >
> >> > > 2017-08-30 05:55:52,208-0400 INFO  (MainThread) [vds] (PID: 4594)
> >> > > I am the actual vdsm 4.20.2-133.git0ce4485.el7.centos
> >> > > lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64)
> >> > > (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO  (MainThread) [vds]
> >> > > VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254)
> >> > > 2017-08-30 05:55:52,253-0400 INFO  (MainThread) [storage.check]
> >> > > Starting check service (check:92)
> >> > > 2017-08-30 05:55:52,257-0400 INFO  (MainThread)
> >> > > [storage.Dispatcher] Starting StorageDispatcher... (dispatcher:48)
> >> > > 2017-08-30 05:55:52,257-0400 INFO  (check/loop)
> >> > > [storage.asyncevent] Starting <EventLoop running=True
> >> > > closed=False at 0x45317520> (asyncevent:125)
> >> > > 2017-08-30 05:55:52,288-0400 INFO  (MainThread) [vdsm.api] START
> >> > > registerDomainStateChangeCallback(callbackFunc=<functools.partial
> >> > > object at 0x2b47b50>) from=internal,
> >> > > task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46)
> >> > > 2017-08-30 05:55:52,288-0400 INFO  (MainThread) [vdsm.api] FINISH
> >> > > registerDomainStateChangeCallback return=None from=internal,
> >> > > task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52)
> >> > > 2017-08-30 05:55:52,288-0400 INFO  (MainThread) [MOM] Preparing
> >> > > MOM interface (momIF:53)
> >> > > 2017-08-30 05:55:52,289-0400 INFO  (MainThread) [MOM] Using named
> >> > > unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62)
> >> > > 2017-08-30 05:55:52,289-0400 INFO  (MainThread) [root]
> >> > > Unregistering all secrets (secret:91)
> >> > > 2017-08-30 05:55:52,307-0400 INFO  (MainThread) [vds] Setting
> >> > > channels' timeout to 30 seconds. (vmchannels:221)
> >> > > 2017-08-30 05:55:52,309-0400 INFO  (vmrecovery) [vds] recovery:
> >> > > completed in 0s (clientIF:516)
> >> > > 2017-08-30 05:55:52,310-0400 INFO  (MainThread)
> >> > > [vds.MultiProtocolAcceptor] Listening at :::54321
> >> > > (protocoldetector:196)
> >> > > 2017-08-30 05:55:52,496-0400 INFO  (http) [vds] Server running
> >> > > (http:58) 2017-08-30 05:55:52,742-0400 INFO  (periodic/1)
> >> > > [vdsm.api] START repoStats(domains=()) from=internal,
> >> > > task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46)
> >> > > 2017-08-30 05:55:52,743-0400 INFO  (periodic/1) [vdsm.api] FINISH
> >> > > repoStats return={} from=internal,
> >> > > task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52)
> >> > > 2017-08-30 05:55:52,743-0400 WARN  (periodic/1) [throttled] MOM
> >> > > not available. (throttledlog:103)
> >> > > 2017-08-30 05:55:52,744-0400 WARN  (periodic/1) [throttled] MOM
> >> > > not available, KSM stats will be missing. (throttledlog:103)
> >> > > 2017-08-30 05:55:55,043-0400 INFO  (MainThread) [vds] Received
> >> > > signal 15, shutting down (vdsmd:67)
> >> > > 2017-08-30 05:55:55,045-0400 INFO  (MainThread)
> >> > > [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759)
> >> > > 2017-08-30 05:55:55,049-0400 INFO  (MainThread) [vds] Stopping
> >> > > http server (http:79)
> >> > > 2017-08-30 05:55:55,049-0400 INFO  (http) [vds] Server stopped
> >> > > (http:69) 2017-08-30 05:55:55,050-0400 INFO  (MainThread) [root]
> >> > > Unregistering all secrets (secret:91)
> >> > > 2017-08-30 05:55:55,052-0400 INFO  (MainThread) [vdsm.api] START
> >> > > prepareForShutdown(options=None) from=internal,
> >> > > task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46)
> >> > > 2017-08-30 05:55:55,089-0400 INFO  (MainThread) [storage.Monitor]
> >> > > Shutting down domain monitors (monitor:222)
> >> > > 2017-08-30 05:55:55,090-0400 INFO  (MainThread) [storage.check]
> >> > > Stopping check service (check:105)
> >> > > 2017-08-30 05:55:55,090-0400 INFO  (check/loop)
> >> > > [storage.asyncevent] Stopping <EventLoop running=True
> >> > > closed=False at 0x45317520> (asyncevent:220)
> >> > > 2017-08-30 05:55:55,090-0400 INFO  (MainThread) [vdsm.api] FINISH
> >> > > prepareForShutdown return=None from=internal,
> >> > > task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52)
> >> > > 2017-08-30 05:55:55,091-0400 INFO  (MainThread) [vds] Stopping
> >> > > threads (vdsmd:159)
> >> > > 2017-08-30 05:55:55,091-0400 INFO  (MainThread) [vds] Exiting
> >> > > (vdsmd:170)
> >> > >
> >> > > </error>
> >> > >
> >> > >
> >> > > - From SuperVDSM logs:
> >> > >
> >> > > <error>
> >> > >
> >> > > MainThread::ERROR::2017-08-30
> >> > > 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable
> >> > > LLDP on eth1
> >> > > Traceback (most recent call last):
> >> > >   File
> >> > > "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py",
> >> > > line 51, in _lldp_init Lldp.enable_lldp_on_iface(device)
> >> > >   File
> >> > > "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py",
> >> > > line 30, in enable_lldp_on_iface
> >> > > lldptool.enable_lldp_on_iface(iface, rx_only) File
> >> > > "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py",
> >> > > line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out,
> >> > > err, iface) EnableLldpError: (1,
> >> > > "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx'
> >> > > command timed out.\n", '', 'eth1')
> >> > > MainThread::DEBUG::2017-08-30
> >> > > 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool
> >> > > get-lldp -i eth0 adminStatus (cwd None)
> >> > >
> >> > >
> >> > > </error>
> >> >
> >> > The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1.
> >> > They do not cause supervdsmd to stop, so they are not the reason
> >> > for vdsm's failure to start. Something ELSE is doing this.
> >> >
> >> > in
> >> > http://jenkins.ovirt.org/job/ovirt-master_change-queue-
> tester/2151/artifact/exported-artifacts/basic-suit-master-
> el7/test_logs/basic-suite-master/post-002_bootstrap.py/
> lago-basic-suite-master-host-0/_var_log/messages/*view*/
> >> > we have
> >> >
> >> > Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm
> >> > throttled WARN MOM not available.
> >> > Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm
> >> > throttled WARN MOM not available, KSM stats will be missing.
> >> > Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started
> >> > Session 10 of user root.
> >> > Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New
> >> > session 10 of user root.
> >> > Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting
> >> > Session 10 of user root.
> >> > Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup
> >> > Invoked with filter=* gather_subset=['all']
> >> > fact_path=/etc/ansible/facts.d gather_timeout=10
> >> > Aug 30 05:55:54 lago-basic-suite-master-host-0 python:
> >> > ansible-command Invoked with warn=True executable=None
> >> > _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE
> >> > 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None
> >> > creates=None chdir=None
> >> > Aug 30 05:55:54 lago-basic-suite-master-host-0 python:
> >> > ansible-systemd Invoked with no_block=False name=libvirt-guests
> >> > enabled=True daemon_reload=False state=started user=False
> >> > masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd:
> >> > Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd:
> >> > Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring:
> >> > Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0
> >> > systemd: Stopped MOM instance configured for VDSM purposes.
> >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping
> >> > Virtual Desktop Server Manager...
> >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting
> >> > Suspend Active Libvirt Guests...
> >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started
> >> > Suspend Active Libvirt Guests.
> >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt
> >> > version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem
> >> > <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org)
> >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname:
> >> > lago-basic-suite-master-host-0.lago.local
> >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh:
> >> > vdsm: Running run_final_hooks
> >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file
> >> > while reading data: Input/output error
> >> > Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped
> >> > Virtual Desktop Server Manager.
> >> >
> >> > suggesting that systemd stopped vdsmd since mom did not start on
> >> > time. We have an extremely ugly two-way dependency between the two
> >> > services. Each depends on the other one. I am guessing that
> >> > systemd's grace period for such a cycle has expired, and the couple
> >> > was taken down.
> >> >
> >> > Does anybody have a guess what could cause such slowlness?
> >>
> >> Even if the LLDP errors does not have a negative impact of the
> >> functional behavior of supervdsmd, they have a impact timing behavior
> >> of the initialization of supervdsmd.
> >>
> >> I will post a patch today which moves the initialization of LLDP in an
> >> extra thread, to make the initialization of LLDP asynchronly.
> >>
> >>
> >
> > I analyzed this issue further by running OST [1] while disabled [2] the
> > suspected initialization of LLDP in supervdsmd.
> > Like expected, there are no log entries about enabling LLDP in [3],
> > but the OST is still failing with same symptoms.
> >
> > So the reason for this failure is still unknown.
> >
> >
> > [1]
> >   http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/
> ovirt-system-tests_manual/1045/
> >
> > [2]
> >   https://gerrit.ovirt.org/#/c/81232/1/lib/vdsm/network/initializer.py
> >
> > [3]
> >   http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/
> ovirt-system-tests_manual/1045/artifact/exported-
> artifacts/test_logs/basic-suite-master/post-002_
> bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/
> supervdsm.log/*view*/
> >
>
> So we're back in square one.
> Another possible culprit may be ansible: Vdsm is stopped two seconds
> after it logs to the host.
>
> Aug 30 11:26:24 lago-basic-suite-master-host-0 systemd: Starting
> Session 10 of user root.
> Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-setup
> Invoked with filter=* gather_subset=['all']
> fact_path=/etc/ansible/facts.d gather_timeout=10
> Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-command
> Invoked with warn=True executable=None _uses_shell=False
> _raw_params=bash -c "rpm -qi vdsm | grep -oE
> 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None
> creates=None chdir=None
> Aug 30 11:26:26 lago-basic-suite-master-host-0 python: ansible-systemd
> Invoked with no_block=False name=libvirt-guests enabled=True
> daemon_reload=False state=started user=False masked=None
> Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Reloading.
> Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Cannot add
> dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked.
> Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopped MOM
> instance configured for VDSM purposes.
> Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopping
> Virtual Desktop Server Manager...
>
>
> could it be that it triggers a systemd-reload that makes systemd croak
> on the vdsm-mom cycle?
>

​We are not restarting VDSM within ovirt-host-deploy Ansible role, the VDSM
restart is performed in host-deploy part same as in previous versions.​

​Within ovirt-host-deploy-firewalld we only enable and restart firewalld
service.
​
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20170830/7af8f210/attachment-0001.html>


More information about the Devel mailing list