On Wed, Aug 30, 2017 at 2:31 PM, Dominik Holler <dholler(a)redhat.com> wrote:
On Wed, 30 Aug 2017 14:18:49 +0300
Dan Kenigsberg <danken(a)redhat.com> wrote:
> On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren <bkorren(a)redhat.com>
> wrote:
> > Test failed: [ 002_bootstrap.add_hosts ]
> >
> > Link to suspected patches:
> >
> > We suspect this is due to change to LLDPAD in upstream CentOS repos.
> > We can't tell the exact point it was introduced because other
> > ovirt-engine regressions introduced too much noise into the system.
> >
> > It also seems that the failure is not 100% reproducible sine we have
> > runs that do not encounter it.
> >
> > Link to Job:
> >
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151
> >
> > (This job checked a specific patch to ovirt-log-collector but
> > failure seems unrelated, and also happens on vanille 'tested' repo
> > right now).
> >
> > Link to all logs:
> > VDSM logs seem to be most relevant here:
> >
> >
http://jenkins.ovirt.org/job/ovirt-master_change-queue-
tester/2151/artifact/exported-artifacts/basic-suit-master-
el7/test_logs/basic-suite-master/post-002_bootstrap.py/
lago-basic-suite-master-host-0/_var_log/vdsm/
> >
> > Error snippet from log:
> >
> > - From VDSM logs (note it exists very quickly):
> >
> > <error>
> >
> > 2017-08-30 05:55:52,208-0400 INFO (MainThread) [vds] (PID: 4594) I
> > am the actual vdsm 4.20.2-133.git0ce4485.el7.centos
> > lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64)
> > (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO (MainThread) [vds]
> > VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254)
> > 2017-08-30 05:55:52,253-0400 INFO (MainThread) [storage.check]
> > Starting check service (check:92)
> > 2017-08-30 05:55:52,257-0400 INFO (MainThread) [storage.Dispatcher]
> > Starting StorageDispatcher... (dispatcher:48)
> > 2017-08-30 05:55:52,257-0400 INFO (check/loop) [storage.asyncevent]
> > Starting <EventLoop running=True closed=False at 0x45317520>
> > (asyncevent:125)
> > 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] START
> > registerDomainStateChangeCallback(callbackFunc=<functools.partial
> > object at 0x2b47b50>) from=internal,
> > task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46)
> > 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] FINISH
> > registerDomainStateChangeCallback return=None from=internal,
> > task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52)
> > 2017-08-30 05:55:52,288-0400 INFO (MainThread) [MOM] Preparing MOM
> > interface (momIF:53)
> > 2017-08-30 05:55:52,289-0400 INFO (MainThread) [MOM] Using named
> > unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62)
> > 2017-08-30 05:55:52,289-0400 INFO (MainThread) [root] Unregistering
> > all secrets (secret:91)
> > 2017-08-30 05:55:52,307-0400 INFO (MainThread) [vds] Setting
> > channels' timeout to 30 seconds. (vmchannels:221)
> > 2017-08-30 05:55:52,309-0400 INFO (vmrecovery) [vds] recovery:
> > completed in 0s (clientIF:516)
> > 2017-08-30 05:55:52,310-0400 INFO (MainThread)
> > [vds.MultiProtocolAcceptor] Listening at :::54321
> > (protocoldetector:196)
> > 2017-08-30 05:55:52,496-0400 INFO (http) [vds] Server running
> > (http:58) 2017-08-30 05:55:52,742-0400 INFO (periodic/1)
> > [vdsm.api] START repoStats(domains=()) from=internal,
> > task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46)
> > 2017-08-30 05:55:52,743-0400 INFO (periodic/1) [vdsm.api] FINISH
> > repoStats return={} from=internal,
> > task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52)
> > 2017-08-30 05:55:52,743-0400 WARN (periodic/1) [throttled] MOM not
> > available. (throttledlog:103)
> > 2017-08-30 05:55:52,744-0400 WARN (periodic/1) [throttled] MOM not
> > available, KSM stats will be missing. (throttledlog:103)
> > 2017-08-30 05:55:55,043-0400 INFO (MainThread) [vds] Received
> > signal 15, shutting down (vdsmd:67)
> > 2017-08-30 05:55:55,045-0400 INFO (MainThread)
> > [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759)
> > 2017-08-30 05:55:55,049-0400 INFO (MainThread) [vds] Stopping http
> > server (http:79)
> > 2017-08-30 05:55:55,049-0400 INFO (http) [vds] Server stopped
> > (http:69) 2017-08-30 05:55:55,050-0400 INFO (MainThread) [root]
> > Unregistering all secrets (secret:91)
> > 2017-08-30 05:55:55,052-0400 INFO (MainThread) [vdsm.api] START
> > prepareForShutdown(options=None) from=internal,
> > task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46)
> > 2017-08-30 05:55:55,089-0400 INFO (MainThread) [storage.Monitor]
> > Shutting down domain monitors (monitor:222)
> > 2017-08-30 05:55:55,090-0400 INFO (MainThread) [storage.check]
> > Stopping check service (check:105)
> > 2017-08-30 05:55:55,090-0400 INFO (check/loop) [storage.asyncevent]
> > Stopping <EventLoop running=True closed=False at 0x45317520>
> > (asyncevent:220)
> > 2017-08-30 05:55:55,090-0400 INFO (MainThread) [vdsm.api] FINISH
> > prepareForShutdown return=None from=internal,
> > task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52)
> > 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Stopping
> > threads (vdsmd:159)
> > 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Exiting
> > (vdsmd:170)
> >
> > </error>
> >
> >
> > - From SuperVDSM logs:
> >
> > <error>
> >
> > MainThread::ERROR::2017-08-30
> > 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable
> > LLDP on eth1
> > Traceback (most recent call last):
> > File
> > "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py",
> > line 51, in _lldp_init Lldp.enable_lldp_on_iface(device)
> > File
> > "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py",
> > line 30, in enable_lldp_on_iface
> > lldptool.enable_lldp_on_iface(iface, rx_only) File
> > "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py",
> > line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out,
> > err, iface) EnableLldpError: (1,
> > "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx'
command
> > timed out.\n", '', 'eth1')
> > MainThread::DEBUG::2017-08-30
> > 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool
> > get-lldp -i eth0 adminStatus (cwd None)
> >
> >
> > </error>
>
> The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1. They
> do not cause supervdsmd to stop, so they are not the reason for vdsm's
> failure to start. Something ELSE is doing this.
>
> in
>
http://jenkins.ovirt.org/job/ovirt-master_change-queue-
tester/2151/artifact/exported-artifacts/basic-suit-master-
el7/test_logs/basic-suite-master/post-002_bootstrap.py/
lago-basic-suite-master-host-0/_var_log/messages/*view*/
> we have
>
> Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled
> WARN MOM not available.
> Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled
> WARN MOM not available, KSM stats will be missing.
> Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started
> Session 10 of user root.
> Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New
> session 10 of user root.
> Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting
> Session 10 of user root.
> Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup
> Invoked with filter=* gather_subset=['all']
> fact_path=/etc/ansible/facts.d gather_timeout=10
> Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command
> Invoked with warn=True executable=None _uses_shell=False
> _raw_params=bash -c "rpm -qi vdsm | grep -oE
> 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None
> creates=None chdir=None
> Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd
> Invoked with no_block=False name=libvirt-guests enabled=True
> daemon_reload=False state=started user=False masked=None
> Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading.
> Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add
> dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked.
> Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM
> instance configured for VDSM purposes.
> Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping
> Virtual Desktop Server Manager...
> Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting
> Suspend Active Libvirt Guests...
> Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started
> Suspend Active Libvirt Guests.
> Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt
> version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem
> <
http://bugs.centos.org>, 2017-05-25-20:52:28,
c1bm.rdu2.centos.org)
> Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname:
> lago-basic-suite-master-host-0.lago.local
> Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh:
> vdsm: Running run_final_hooks
> Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file
> while reading data: Input/output error
> Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped
> Virtual Desktop Server Manager.
>
> suggesting that systemd stopped vdsmd since mom did not start on time.
> We have an extremely ugly two-way dependency between the two services.
> Each depends on the other one. I am guessing that systemd's grace
> period for such a cycle has expired, and the couple was taken down.
>
> Does anybody have a guess what could cause such slowlness?
Even if the LLDP errors does not have a negative impact of the
functional behavior of supervdsmd, they have a impact timing behavior
of the initialization of supervdsmd.
I will post a patch today which moves the initialization of LLDP in an
extra thread, to make the initialization of LLDP asynchronly.
I'm not sure if this is possible. The LLDP init is processed during the
initialization
step of supervdsm service and we depend on it to execute prior to enter the
running
stage.
If we will use a thread in this case, the running stage will start but one
of the inits
may not end.
Perhaps just do some specific checks on init to make it faster. (like check
if we
have the selinux issue and break early)