[ovirt-devel] [ OST Failure Report ] [ oVirt master ] [ 2017-08-30 ] [add_hosts]

Dan Kenigsberg danken at redhat.com
Wed Aug 30 11:18:49 UTC 2017


On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren <bkorren at redhat.com> wrote:
> Test failed: [ 002_bootstrap.add_hosts ]
>
> Link to suspected patches:
>
> We suspect this is due to change to LLDPAD in upstream CentOS repos.
> We can't tell the exact point it was introduced because other
> ovirt-engine regressions introduced too much noise into the system.
>
> It also seems that the failure is not 100% reproducible sine we have
> runs that do not encounter it.
>
> Link to Job:
> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151
>
> (This job checked a specific patch to ovirt-log-collector but failure
> seems unrelated, and also happens on vanille 'tested' repo right now).
>
> Link to all logs:
> VDSM logs seem to be most relevant here:
>
> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/
>
> Error snippet from log:
>
> - From VDSM logs (note it exists very quickly):
>
> <error>
>
> 2017-08-30 05:55:52,208-0400 INFO  (MainThread) [vds] (PID: 4594) I am
> the actual vdsm 4.20.2-133.git0ce4485.el7.centos
> lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64) (vdsmd:148)
> 2017-08-30 05:55:52,209-0400 INFO  (MainThread) [vds] VDSM will run
> with cpu affinity: frozenset([1]) (vdsmd:254)
> 2017-08-30 05:55:52,253-0400 INFO  (MainThread) [storage.check]
> Starting check service (check:92)
> 2017-08-30 05:55:52,257-0400 INFO  (MainThread) [storage.Dispatcher]
> Starting StorageDispatcher... (dispatcher:48)
> 2017-08-30 05:55:52,257-0400 INFO  (check/loop) [storage.asyncevent]
> Starting <EventLoop running=True closed=False at 0x45317520>
> (asyncevent:125)
> 2017-08-30 05:55:52,288-0400 INFO  (MainThread) [vdsm.api] START
> registerDomainStateChangeCallback(callbackFunc=<functools.partial
> object at 0x2b47b50>) from=internal,
> task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46)
> 2017-08-30 05:55:52,288-0400 INFO  (MainThread) [vdsm.api] FINISH
> registerDomainStateChangeCallback return=None from=internal,
> task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52)
> 2017-08-30 05:55:52,288-0400 INFO  (MainThread) [MOM] Preparing MOM
> interface (momIF:53)
> 2017-08-30 05:55:52,289-0400 INFO  (MainThread) [MOM] Using named unix
> socket /var/run/vdsm/mom-vdsm.sock (momIF:62)
> 2017-08-30 05:55:52,289-0400 INFO  (MainThread) [root] Unregistering
> all secrets (secret:91)
> 2017-08-30 05:55:52,307-0400 INFO  (MainThread) [vds] Setting
> channels' timeout to 30 seconds. (vmchannels:221)
> 2017-08-30 05:55:52,309-0400 INFO  (vmrecovery) [vds] recovery:
> completed in 0s (clientIF:516)
> 2017-08-30 05:55:52,310-0400 INFO  (MainThread)
> [vds.MultiProtocolAcceptor] Listening at :::54321
> (protocoldetector:196)
> 2017-08-30 05:55:52,496-0400 INFO  (http) [vds] Server running (http:58)
> 2017-08-30 05:55:52,742-0400 INFO  (periodic/1) [vdsm.api] START
> repoStats(domains=()) from=internal,
> task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46)
> 2017-08-30 05:55:52,743-0400 INFO  (periodic/1) [vdsm.api] FINISH
> repoStats return={} from=internal,
> task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52)
> 2017-08-30 05:55:52,743-0400 WARN  (periodic/1) [throttled] MOM not
> available. (throttledlog:103)
> 2017-08-30 05:55:52,744-0400 WARN  (periodic/1) [throttled] MOM not
> available, KSM stats will be missing. (throttledlog:103)
> 2017-08-30 05:55:55,043-0400 INFO  (MainThread) [vds] Received signal
> 15, shutting down (vdsmd:67)
> 2017-08-30 05:55:55,045-0400 INFO  (MainThread)
> [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759)
> 2017-08-30 05:55:55,049-0400 INFO  (MainThread) [vds] Stopping http
> server (http:79)
> 2017-08-30 05:55:55,049-0400 INFO  (http) [vds] Server stopped (http:69)
> 2017-08-30 05:55:55,050-0400 INFO  (MainThread) [root] Unregistering
> all secrets (secret:91)
> 2017-08-30 05:55:55,052-0400 INFO  (MainThread) [vdsm.api] START
> prepareForShutdown(options=None) from=internal,
> task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46)
> 2017-08-30 05:55:55,089-0400 INFO  (MainThread) [storage.Monitor]
> Shutting down domain monitors (monitor:222)
> 2017-08-30 05:55:55,090-0400 INFO  (MainThread) [storage.check]
> Stopping check service (check:105)
> 2017-08-30 05:55:55,090-0400 INFO  (check/loop) [storage.asyncevent]
> Stopping <EventLoop running=True closed=False at 0x45317520>
> (asyncevent:220)
> 2017-08-30 05:55:55,090-0400 INFO  (MainThread) [vdsm.api] FINISH
> prepareForShutdown return=None from=internal,
> task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52)
> 2017-08-30 05:55:55,091-0400 INFO  (MainThread) [vds] Stopping threads
> (vdsmd:159)
> 2017-08-30 05:55:55,091-0400 INFO  (MainThread) [vds] Exiting (vdsmd:170)
>
> </error>
>
>
> - From SuperVDSM logs:
>
> <error>
>
> MainThread::ERROR::2017-08-30
> 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable
> LLDP on eth1
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py",
> line 51, in _lldp_init
>     Lldp.enable_lldp_on_iface(device)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py",
> line 30, in enable_lldp_on_iface
>     lldptool.enable_lldp_on_iface(iface, rx_only)
>   File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py",
> line 46, in enable_lldp_on_iface
>     raise EnableLldpError(rc, out, err, iface)
> EnableLldpError: (1,
> "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx' command
> timed out.\n", '', 'eth1')
> MainThread::DEBUG::2017-08-30
> 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool
> get-lldp -i eth0 adminStatus (cwd None)
>
>
> </error>

The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1. They
do not cause supervdsmd to stop, so they are not the reason for vdsm's
failure to start. Something ELSE is doing this.

in
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host-0/_var_log/messages/*view*/
we have

Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled
WARN MOM not available.
Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled
WARN MOM not available, KSM stats will be missing.
Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started
Session 10 of user root.
Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New
session 10 of user root.
Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting
Session 10 of user root.
Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup
Invoked with filter=* gather_subset=['all']
fact_path=/etc/ansible/facts.d gather_timeout=10
Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command
Invoked with warn=True executable=None _uses_shell=False
_raw_params=bash -c "rpm -qi vdsm | grep -oE
'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None
creates=None chdir=None
Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd
Invoked with no_block=False name=libvirt-guests enabled=True
daemon_reload=False state=started user=False masked=None
Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading.
Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add
dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked.
Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM
instance configured for VDSM purposes.
Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping
Virtual Desktop Server Manager...
Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting
Suspend Active Libvirt Guests...
Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started
Suspend Active Libvirt Guests.
Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt
version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem
<http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org)
Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname:
lago-basic-suite-master-host-0.lago.local
Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh:
vdsm: Running run_final_hooks
Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file
while reading data: Input/output error
Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped
Virtual Desktop Server Manager.

suggesting that systemd stopped vdsmd since mom did not start on time.
We have an extremely ugly two-way dependency between the two services.
Each depends on the other one. I am guessing that  systemd's grace
period for such a cycle has expired, and the couple was taken down.

Does anybody have a guess what could cause such slowlness?


More information about the Devel mailing list