<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Aug 30, 2017 at 2:31 PM, Dominik Holler <span dir="ltr">&lt;<a href="mailto:dholler@redhat.com" target="_blank">dholler@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Wed, 30 Aug 2017 14:18:49 +0300<br>
Dan Kenigsberg &lt;<a href="mailto:danken@redhat.com">danken@redhat.com</a>&gt; wrote:<br>
<br>
&gt; On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren &lt;<a href="mailto:bkorren@redhat.com">bkorren@redhat.com</a>&gt;<br>
&gt; wrote:<br>
&gt; &gt; Test failed: [ 002_bootstrap.add_hosts ]<br>
&gt; &gt;<br>
&gt; &gt; Link to suspected patches:<br>
&gt; &gt;<br>
&gt; &gt; We suspect this is due to change to LLDPAD in upstream CentOS repos.<br>
&gt; &gt; We can&#39;t tell the exact point it was introduced because other<br>
&gt; &gt; ovirt-engine regressions introduced too much noise into the system.<br>
&gt; &gt;<br>
&gt; &gt; It also seems that the failure is not 100% reproducible sine we have<br>
&gt; &gt; runs that do not encounter it.<br>
&gt; &gt;<br>
&gt; &gt; Link to Job:<br>
&gt; &gt; <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/2151</a><br>
&gt; &gt;<br>
&gt; &gt; (This job checked a specific patch to ovirt-log-collector but<br>
&gt; &gt; failure seems unrelated, and also happens on vanille &#39;tested&#39; repo<br>
&gt; &gt; right now).<br>
&gt; &gt;<br>
&gt; &gt; Link to all logs:<br>
&gt; &gt; VDSM logs seem to be most relevant here:<br>
&gt; &gt;<br>
&gt; &gt; <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/2151/artifact/exported-<wbr>artifacts/basic-suit-master-<wbr>el7/test_logs/basic-suite-<wbr>master/post-002_bootstrap.py/<wbr>lago-basic-suite-master-host-<wbr>0/_var_log/vdsm/</a><br>
&gt; &gt;<br>
&gt; &gt; Error snippet from log:<br>
&gt; &gt;<br>
&gt; &gt; - From VDSM logs (note it exists very quickly):<br>
&gt; &gt;<br>
&gt; &gt; &lt;error&gt;<br>
&gt; &gt;<br>
&gt; &gt; 2017-08-30 05:55:52,208-0400 INFO  (MainThread) [vds] (PID: 4594) I<br>
&gt; &gt; am the actual vdsm 4.20.2-133.git0ce4485.el7.<wbr>centos<br>
&gt; &gt; lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64)<br>
&gt; &gt; (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO  (MainThread) [vds]<br>
&gt; &gt; VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254)<br>
&gt; &gt; 2017-08-30 05:55:52,253-0400 INFO  (MainThread) [storage.check]<br>
&gt; &gt; Starting check service (check:92)<br>
&gt; &gt; 2017-08-30 05:55:52,257-0400 INFO  (MainThread) [storage.Dispatcher]<br>
&gt; &gt; Starting StorageDispatcher... (dispatcher:48)<br>
&gt; &gt; 2017-08-30 05:55:52,257-0400 INFO  (check/loop) [storage.asyncevent]<br>
&gt; &gt; Starting &lt;EventLoop running=True closed=False at 0x45317520&gt;<br>
&gt; &gt; (asyncevent:125)<br>
&gt; &gt; 2017-08-30 05:55:52,288-0400 INFO  (MainThread) [vdsm.api] START<br>
&gt; &gt; registerDomainStateChangeCallb<wbr>ack(callbackFunc=&lt;functools.<wbr>partial<br>
&gt; &gt; object at 0x2b47b50&gt;) from=internal,<br>
&gt; &gt; task_id=2cebe9ef-358e-47d7-<wbr>81a6-8c4b54b9cd6d (api:46)<br>
&gt; &gt; 2017-08-30 05:55:52,288-0400 INFO  (MainThread) [vdsm.api] FINISH<br>
&gt; &gt; registerDomainStateChangeCallb<wbr>ack return=None from=internal,<br>
&gt; &gt; task_id=2cebe9ef-358e-47d7-<wbr>81a6-8c4b54b9cd6d (api:52)<br>
&gt; &gt; 2017-08-30 05:55:52,288-0400 INFO  (MainThread) [MOM] Preparing MOM<br>
&gt; &gt; interface (momIF:53)<br>
&gt; &gt; 2017-08-30 05:55:52,289-0400 INFO  (MainThread) [MOM] Using named<br>
&gt; &gt; unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62)<br>
&gt; &gt; 2017-08-30 05:55:52,289-0400 INFO  (MainThread) [root] Unregistering<br>
&gt; &gt; all secrets (secret:91)<br>
&gt; &gt; 2017-08-30 05:55:52,307-0400 INFO  (MainThread) [vds] Setting<br>
&gt; &gt; channels&#39; timeout to 30 seconds. (vmchannels:221)<br>
&gt; &gt; 2017-08-30 05:55:52,309-0400 INFO  (vmrecovery) [vds] recovery:<br>
&gt; &gt; completed in 0s (clientIF:516)<br>
&gt; &gt; 2017-08-30 05:55:52,310-0400 INFO  (MainThread)<br>
&gt; &gt; [vds.MultiProtocolAcceptor] Listening at :::54321<br>
&gt; &gt; (protocoldetector:196)<br>
&gt; &gt; 2017-08-30 05:55:52,496-0400 INFO  (http) [vds] Server running<br>
&gt; &gt; (http:58) 2017-08-30 05:55:52,742-0400 INFO  (periodic/1)<br>
&gt; &gt; [vdsm.api] START repoStats(domains=()) from=internal,<br>
&gt; &gt; task_id=85768015-8ecb-48e3-<wbr>9307-f671bfc33c65 (api:46)<br>
&gt; &gt; 2017-08-30 05:55:52,743-0400 INFO  (periodic/1) [vdsm.api] FINISH<br>
&gt; &gt; repoStats return={} from=internal,<br>
&gt; &gt; task_id=85768015-8ecb-48e3-<wbr>9307-f671bfc33c65 (api:52)<br>
&gt; &gt; 2017-08-30 05:55:52,743-0400 WARN  (periodic/1) [throttled] MOM not<br>
&gt; &gt; available. (throttledlog:103)<br>
&gt; &gt; 2017-08-30 05:55:52,744-0400 WARN  (periodic/1) [throttled] MOM not<br>
&gt; &gt; available, KSM stats will be missing. (throttledlog:103)<br>
&gt; &gt; 2017-08-30 05:55:55,043-0400 INFO  (MainThread) [vds] Received<br>
&gt; &gt; signal 15, shutting down (vdsmd:67)<br>
&gt; &gt; 2017-08-30 05:55:55,045-0400 INFO  (MainThread)<br>
&gt; &gt; [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759)<br>
&gt; &gt; 2017-08-30 05:55:55,049-0400 INFO  (MainThread) [vds] Stopping http<br>
&gt; &gt; server (http:79)<br>
&gt; &gt; 2017-08-30 05:55:55,049-0400 INFO  (http) [vds] Server stopped<br>
&gt; &gt; (http:69) 2017-08-30 05:55:55,050-0400 INFO  (MainThread) [root]<br>
&gt; &gt; Unregistering all secrets (secret:91)<br>
&gt; &gt; 2017-08-30 05:55:55,052-0400 INFO  (MainThread) [vdsm.api] START<br>
&gt; &gt; prepareForShutdown(options=<wbr>None) from=internal,<br>
&gt; &gt; task_id=ffde5caa-fa44-49ab-<wbr>bdd1-df81519680a3 (api:46)<br>
&gt; &gt; 2017-08-30 05:55:55,089-0400 INFO  (MainThread) [storage.Monitor]<br>
&gt; &gt; Shutting down domain monitors (monitor:222)<br>
&gt; &gt; 2017-08-30 05:55:55,090-0400 INFO  (MainThread) [storage.check]<br>
&gt; &gt; Stopping check service (check:105)<br>
&gt; &gt; 2017-08-30 05:55:55,090-0400 INFO  (check/loop) [storage.asyncevent]<br>
&gt; &gt; Stopping &lt;EventLoop running=True closed=False at 0x45317520&gt;<br>
&gt; &gt; (asyncevent:220)<br>
&gt; &gt; 2017-08-30 05:55:55,090-0400 INFO  (MainThread) [vdsm.api] FINISH<br>
&gt; &gt; prepareForShutdown return=None from=internal,<br>
&gt; &gt; task_id=ffde5caa-fa44-49ab-<wbr>bdd1-df81519680a3 (api:52)<br>
&gt; &gt; 2017-08-30 05:55:55,091-0400 INFO  (MainThread) [vds] Stopping<br>
&gt; &gt; threads (vdsmd:159)<br>
&gt; &gt; 2017-08-30 05:55:55,091-0400 INFO  (MainThread) [vds] Exiting<br>
&gt; &gt; (vdsmd:170)<br>
&gt; &gt;<br>
&gt; &gt; &lt;/error&gt;<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; - From SuperVDSM logs:<br>
&gt; &gt;<br>
&gt; &gt; &lt;error&gt;<br>
&gt; &gt;<br>
&gt; &gt; MainThread::ERROR::2017-08-30<br>
&gt; &gt; 05:55:59,476::initializer::53:<wbr>:root::(_lldp_init) Failed to enable<br>
&gt; &gt; LLDP on eth1<br>
&gt; &gt; Traceback (most recent call last):<br>
&gt; &gt;   File<br>
&gt; &gt; &quot;/usr/lib/python2.7/site-<wbr>packages/vdsm/network/<wbr>initializer.py&quot;,<br>
&gt; &gt; line 51, in _lldp_init Lldp.enable_lldp_on_iface(<wbr>device)<br>
&gt; &gt;   File<br>
&gt; &gt; &quot;/usr/lib/python2.7/site-<wbr>packages/vdsm/network/lldp/<wbr>lldpad.py&quot;,<br>
&gt; &gt; line 30, in enable_lldp_on_iface<br>
&gt; &gt; lldptool.enable_lldp_on_iface(<wbr>iface, rx_only) File<br>
&gt; &gt; &quot;/usr/lib/python2.7/site-<wbr>packages/vdsm/network/lldpad/<wbr>lldptool.py&quot;,<br>
&gt; &gt; line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out,<br>
&gt; &gt; err, iface) EnableLldpError: (1,<br>
&gt; &gt; &quot;timeout\n&#39;<wbr>M00000001C3040000000c04eth1000<wbr>badminStatus0002rx&#39; command<br>
&gt; &gt; timed out.\n&quot;, &#39;&#39;, &#39;eth1&#39;)<br>
&gt; &gt; MainThread::DEBUG::2017-08-30<br>
&gt; &gt; 05:55:59,477::cmdutils::133::<wbr>root::(exec_cmd) /usr/sbin/lldptool<br>
&gt; &gt; get-lldp -i eth0 adminStatus (cwd None)<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; &lt;/error&gt;<br>
&gt;<br>
&gt; The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1. They<br>
&gt; do not cause supervdsmd to stop, so they are not the reason for vdsm&#39;s<br>
&gt; failure to start. Something ELSE is doing this.<br>
&gt;<br>
&gt; in<br>
&gt; <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host-0/_var_log/messages/*view*/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/2151/artifact/exported-<wbr>artifacts/basic-suit-master-<wbr>el7/test_logs/basic-suite-<wbr>master/post-002_bootstrap.py/<wbr>lago-basic-suite-master-host-<wbr>0/_var_log/messages/*view*/</a><br>
&gt; we have<br>
&gt;<br>
&gt; Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled<br>
&gt; WARN MOM not available.<br>
&gt; Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled<br>
&gt; WARN MOM not available, KSM stats will be missing.<br>
&gt; Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started<br>
&gt; Session 10 of user root.<br>
&gt; Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New<br>
&gt; session 10 of user root.<br>
&gt; Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting<br>
&gt; Session 10 of user root.<br>
&gt; Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup<br>
&gt; Invoked with filter=* gather_subset=[&#39;all&#39;]<br>
&gt; fact_path=/etc/ansible/facts.d gather_timeout=10<br>
&gt; Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command<br>
&gt; Invoked with warn=True executable=None _uses_shell=False<br>
&gt; _raw_params=bash -c &quot;rpm -qi vdsm | grep -oE<br>
&gt; &#39;Version\\s+:\\s+[0-9\\.]+&#39; | awk &#39;{print $3}&#39;&quot; removes=None<br>
&gt; creates=None chdir=None<br>
&gt; Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd<br>
&gt; Invoked with no_block=False name=libvirt-guests enabled=True<br>
&gt; daemon_reload=False state=started user=False masked=None<br>
&gt; Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading.<br>
&gt; Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add<br>
&gt; dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked.<br>
&gt; Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM<br>
&gt; instance configured for VDSM purposes.<br>
&gt; Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping<br>
&gt; Virtual Desktop Server Manager...<br>
&gt; Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting<br>
&gt; Suspend Active Libvirt Guests...<br>
&gt; Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started<br>
&gt; Suspend Active Libvirt Guests.<br>
&gt; Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt<br>
&gt; version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem<br>
&gt; &lt;<a href="http://bugs.centos.org" rel="noreferrer" target="_blank">http://bugs.centos.org</a>&gt;, <a href="tel:2017-05-25-20" value="+12017052520">2017-05-25-20</a>:52:28, <a href="http://c1bm.rdu2.centos.org" rel="noreferrer" target="_blank">c1bm.rdu2.centos.org</a>)<br>
&gt; Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname:<br>
&gt; lago-basic-suite-master-host-<wbr>0.lago.local<br>
&gt; Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh:<br>
&gt; vdsm: Running run_final_hooks<br>
&gt; Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file<br>
&gt; while reading data: Input/output error<br>
&gt; Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped<br>
&gt; Virtual Desktop Server Manager.<br>
&gt;<br>
&gt; suggesting that systemd stopped vdsmd since mom did not start on time.<br>
&gt; We have an extremely ugly two-way dependency between the two services.<br>
&gt; Each depends on the other one. I am guessing that  systemd&#39;s grace<br>
&gt; period for such a cycle has expired, and the couple was taken down.<br>
&gt;<br>
&gt; Does anybody have a guess what could cause such slowlness?<br>
<br>
</div></div>Even if the LLDP errors does not have a negative impact of the<br>
functional behavior of supervdsmd, they have a impact timing behavior<br>
of the initialization of supervdsmd.<br>
<br>
I will post a patch today which moves the initialization of LLDP in an<br>
extra thread, to make the initialization of LLDP asynchronly.<br>
<div class="HOEnZb"><div class="h5"><br></div></div></blockquote><div><br></div><div>I&#39;m not sure if this is possible. The LLDP init is processed during the initialization<br></div><div>step of supervdsm service and we depend on it to execute prior to enter the running<br></div><div>stage.<br></div><div>If we will use a thread in this case, the running stage will start but one of the inits<br></div><div>may not end.<br><br></div><div>Perhaps just do some specific checks on init to make it faster. (like check if we<br></div><div>have the selinux issue and break early)<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">
<br>
______________________________<wbr>_________________<br>
Devel mailing list<br>
<a href="mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>
<a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/devel</a><br>
</div></div></blockquote></div><br></div></div>