[ OST Failure Report ] [ oVirt master ] [ 2017-08-30 ] [add_hosts]

Test failed: [ 002_bootstrap.add_hosts ] Link to suspected patches: We suspect this is due to change to LLDPAD in upstream CentOS repos. We can't tell the exact point it was introduced because other ovirt-engine regressions introduced too much noise into the system. It also seems that the failure is not 100% reproducible sine we have runs that do not encounter it. Link to Job: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151 (This job checked a specific patch to ovirt-log-collector but failure seems unrelated, and also happens on vanille 'tested' repo right now). Link to all logs: VDSM logs seem to be most relevant here: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/... Error snippet from log: - From VDSM logs (note it exists very quickly): <error> 2017-08-30 05:55:52,208-0400 INFO (MainThread) [vds] (PID: 4594) I am the actual vdsm 4.20.2-133.git0ce4485.el7.centos lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64) (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO (MainThread) [vds] VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254) 2017-08-30 05:55:52,253-0400 INFO (MainThread) [storage.check] Starting check service (check:92) 2017-08-30 05:55:52,257-0400 INFO (MainThread) [storage.Dispatcher] Starting StorageDispatcher... (dispatcher:48) 2017-08-30 05:55:52,257-0400 INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x45317520> (asyncevent:125) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] START registerDomainStateChangeCallback(callbackFunc=<functools.partial object at 0x2b47b50>) from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] FINISH registerDomainStateChangeCallback return=None from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [MOM] Preparing MOM interface (momIF:53) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [MOM] Using named unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:52,307-0400 INFO (MainThread) [vds] Setting channels' timeout to 30 seconds. (vmchannels:221) 2017-08-30 05:55:52,309-0400 INFO (vmrecovery) [vds] recovery: completed in 0s (clientIF:516) 2017-08-30 05:55:52,310-0400 INFO (MainThread) [vds.MultiProtocolAcceptor] Listening at :::54321 (protocoldetector:196) 2017-08-30 05:55:52,496-0400 INFO (http) [vds] Server running (http:58) 2017-08-30 05:55:52,742-0400 INFO (periodic/1) [vdsm.api] START repoStats(domains=()) from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46) 2017-08-30 05:55:52,743-0400 INFO (periodic/1) [vdsm.api] FINISH repoStats return={} from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52) 2017-08-30 05:55:52,743-0400 WARN (periodic/1) [throttled] MOM not available. (throttledlog:103) 2017-08-30 05:55:52,744-0400 WARN (periodic/1) [throttled] MOM not available, KSM stats will be missing. (throttledlog:103) 2017-08-30 05:55:55,043-0400 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:67) 2017-08-30 05:55:55,045-0400 INFO (MainThread) [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759) 2017-08-30 05:55:55,049-0400 INFO (MainThread) [vds] Stopping http server (http:79) 2017-08-30 05:55:55,049-0400 INFO (http) [vds] Server stopped (http:69) 2017-08-30 05:55:55,050-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:55,052-0400 INFO (MainThread) [vdsm.api] START prepareForShutdown(options=None) from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46) 2017-08-30 05:55:55,089-0400 INFO (MainThread) [storage.Monitor] Shutting down domain monitors (monitor:222) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [storage.check] Stopping check service (check:105) 2017-08-30 05:55:55,090-0400 INFO (check/loop) [storage.asyncevent] Stopping <EventLoop running=True closed=False at 0x45317520> (asyncevent:220) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [vdsm.api] FINISH prepareForShutdown return=None from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Stopping threads (vdsmd:159) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Exiting (vdsmd:170) </error> - From SuperVDSM logs: <error> MainThread::ERROR::2017-08-30 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable LLDP on eth1 Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line 51, in _lldp_init Lldp.enable_lldp_on_iface(device) File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line 30, in enable_lldp_on_iface lldptool.enable_lldp_on_iface(iface, rx_only) File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out, err, iface) EnableLldpError: (1, "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx' command timed out.\n", '', 'eth1') MainThread::DEBUG::2017-08-30 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool get-lldp -i eth0 adminStatus (cwd None) </error> -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren <bkorren@redhat.com> wrote:
Test failed: [ 002_bootstrap.add_hosts ]
Link to suspected patches:
We suspect this is due to change to LLDPAD in upstream CentOS repos. We can't tell the exact point it was introduced because other ovirt-engine regressions introduced too much noise into the system.
It also seems that the failure is not 100% reproducible sine we have runs that do not encounter it.
Link to Job: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151
(This job checked a specific patch to ovirt-log-collector but failure seems unrelated, and also happens on vanille 'tested' repo right now).
Link to all logs: VDSM logs seem to be most relevant here:
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/...
Error snippet from log:
- From VDSM logs (note it exists very quickly):
<error>
2017-08-30 05:55:52,208-0400 INFO (MainThread) [vds] (PID: 4594) I am the actual vdsm 4.20.2-133.git0ce4485.el7.centos lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64) (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO (MainThread) [vds] VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254) 2017-08-30 05:55:52,253-0400 INFO (MainThread) [storage.check] Starting check service (check:92) 2017-08-30 05:55:52,257-0400 INFO (MainThread) [storage.Dispatcher] Starting StorageDispatcher... (dispatcher:48) 2017-08-30 05:55:52,257-0400 INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x45317520> (asyncevent:125) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] START registerDomainStateChangeCallback(callbackFunc=<functools.partial object at 0x2b47b50>) from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] FINISH registerDomainStateChangeCallback return=None from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [MOM] Preparing MOM interface (momIF:53) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [MOM] Using named unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:52,307-0400 INFO (MainThread) [vds] Setting channels' timeout to 30 seconds. (vmchannels:221) 2017-08-30 05:55:52,309-0400 INFO (vmrecovery) [vds] recovery: completed in 0s (clientIF:516) 2017-08-30 05:55:52,310-0400 INFO (MainThread) [vds.MultiProtocolAcceptor] Listening at :::54321 (protocoldetector:196) 2017-08-30 05:55:52,496-0400 INFO (http) [vds] Server running (http:58) 2017-08-30 05:55:52,742-0400 INFO (periodic/1) [vdsm.api] START repoStats(domains=()) from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46) 2017-08-30 05:55:52,743-0400 INFO (periodic/1) [vdsm.api] FINISH repoStats return={} from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52) 2017-08-30 05:55:52,743-0400 WARN (periodic/1) [throttled] MOM not available. (throttledlog:103) 2017-08-30 05:55:52,744-0400 WARN (periodic/1) [throttled] MOM not available, KSM stats will be missing. (throttledlog:103) 2017-08-30 05:55:55,043-0400 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:67) 2017-08-30 05:55:55,045-0400 INFO (MainThread) [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759) 2017-08-30 05:55:55,049-0400 INFO (MainThread) [vds] Stopping http server (http:79) 2017-08-30 05:55:55,049-0400 INFO (http) [vds] Server stopped (http:69) 2017-08-30 05:55:55,050-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:55,052-0400 INFO (MainThread) [vdsm.api] START prepareForShutdown(options=None) from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46) 2017-08-30 05:55:55,089-0400 INFO (MainThread) [storage.Monitor] Shutting down domain monitors (monitor:222) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [storage.check] Stopping check service (check:105) 2017-08-30 05:55:55,090-0400 INFO (check/loop) [storage.asyncevent] Stopping <EventLoop running=True closed=False at 0x45317520> (asyncevent:220) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [vdsm.api] FINISH prepareForShutdown return=None from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Stopping threads (vdsmd:159) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Exiting (vdsmd:170)
</error>
- From SuperVDSM logs:
<error>
MainThread::ERROR::2017-08-30 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable LLDP on eth1 Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line 51, in _lldp_init Lldp.enable_lldp_on_iface(device) File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line 30, in enable_lldp_on_iface lldptool.enable_lldp_on_iface(iface, rx_only) File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out, err, iface) EnableLldpError: (1, "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx' command timed out.\n", '', 'eth1') MainThread::DEBUG::2017-08-30 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool get-lldp -i eth0 adminStatus (cwd None)
</error>
The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1. They do not cause supervdsmd to stop, so they are not the reason for vdsm's failure to start. Something ELSE is doing this. in http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/... we have Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available. Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available, KSM stats will be missing. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting Suspend Active Libvirt Guests... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started Suspend Active Libvirt Guests. Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: lago-basic-suite-master-host-0.lago.local Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: vdsm: Running run_final_hooks Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file while reading data: Input/output error Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped Virtual Desktop Server Manager. suggesting that systemd stopped vdsmd since mom did not start on time. We have an extremely ugly two-way dependency between the two services. Each depends on the other one. I am guessing that systemd's grace period for such a cycle has expired, and the couple was taken down. Does anybody have a guess what could cause such slowlness?

On Wed, 30 Aug 2017 14:18:49 +0300 Dan Kenigsberg <danken@redhat.com> wrote:
On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren <bkorren@redhat.com> wrote:
Test failed: [ 002_bootstrap.add_hosts ]
Link to suspected patches:
We suspect this is due to change to LLDPAD in upstream CentOS repos. We can't tell the exact point it was introduced because other ovirt-engine regressions introduced too much noise into the system.
It also seems that the failure is not 100% reproducible sine we have runs that do not encounter it.
Link to Job: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151
(This job checked a specific patch to ovirt-log-collector but failure seems unrelated, and also happens on vanille 'tested' repo right now).
Link to all logs: VDSM logs seem to be most relevant here:
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/...
Error snippet from log:
- From VDSM logs (note it exists very quickly):
<error>
2017-08-30 05:55:52,208-0400 INFO (MainThread) [vds] (PID: 4594) I am the actual vdsm 4.20.2-133.git0ce4485.el7.centos lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64) (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO (MainThread) [vds] VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254) 2017-08-30 05:55:52,253-0400 INFO (MainThread) [storage.check] Starting check service (check:92) 2017-08-30 05:55:52,257-0400 INFO (MainThread) [storage.Dispatcher] Starting StorageDispatcher... (dispatcher:48) 2017-08-30 05:55:52,257-0400 INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x45317520> (asyncevent:125) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] START registerDomainStateChangeCallback(callbackFunc=<functools.partial object at 0x2b47b50>) from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] FINISH registerDomainStateChangeCallback return=None from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [MOM] Preparing MOM interface (momIF:53) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [MOM] Using named unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:52,307-0400 INFO (MainThread) [vds] Setting channels' timeout to 30 seconds. (vmchannels:221) 2017-08-30 05:55:52,309-0400 INFO (vmrecovery) [vds] recovery: completed in 0s (clientIF:516) 2017-08-30 05:55:52,310-0400 INFO (MainThread) [vds.MultiProtocolAcceptor] Listening at :::54321 (protocoldetector:196) 2017-08-30 05:55:52,496-0400 INFO (http) [vds] Server running (http:58) 2017-08-30 05:55:52,742-0400 INFO (periodic/1) [vdsm.api] START repoStats(domains=()) from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46) 2017-08-30 05:55:52,743-0400 INFO (periodic/1) [vdsm.api] FINISH repoStats return={} from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52) 2017-08-30 05:55:52,743-0400 WARN (periodic/1) [throttled] MOM not available. (throttledlog:103) 2017-08-30 05:55:52,744-0400 WARN (periodic/1) [throttled] MOM not available, KSM stats will be missing. (throttledlog:103) 2017-08-30 05:55:55,043-0400 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:67) 2017-08-30 05:55:55,045-0400 INFO (MainThread) [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759) 2017-08-30 05:55:55,049-0400 INFO (MainThread) [vds] Stopping http server (http:79) 2017-08-30 05:55:55,049-0400 INFO (http) [vds] Server stopped (http:69) 2017-08-30 05:55:55,050-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:55,052-0400 INFO (MainThread) [vdsm.api] START prepareForShutdown(options=None) from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46) 2017-08-30 05:55:55,089-0400 INFO (MainThread) [storage.Monitor] Shutting down domain monitors (monitor:222) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [storage.check] Stopping check service (check:105) 2017-08-30 05:55:55,090-0400 INFO (check/loop) [storage.asyncevent] Stopping <EventLoop running=True closed=False at 0x45317520> (asyncevent:220) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [vdsm.api] FINISH prepareForShutdown return=None from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Stopping threads (vdsmd:159) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Exiting (vdsmd:170)
</error>
- From SuperVDSM logs:
<error>
MainThread::ERROR::2017-08-30 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable LLDP on eth1 Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line 51, in _lldp_init Lldp.enable_lldp_on_iface(device) File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line 30, in enable_lldp_on_iface lldptool.enable_lldp_on_iface(iface, rx_only) File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out, err, iface) EnableLldpError: (1, "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx' command timed out.\n", '', 'eth1') MainThread::DEBUG::2017-08-30 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool get-lldp -i eth0 adminStatus (cwd None)
</error>
The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1. They do not cause supervdsmd to stop, so they are not the reason for vdsm's failure to start. Something ELSE is doing this.
in http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/... we have
Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available. Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available, KSM stats will be missing. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting Suspend Active Libvirt Guests... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started Suspend Active Libvirt Guests. Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: lago-basic-suite-master-host-0.lago.local Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: vdsm: Running run_final_hooks Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file while reading data: Input/output error Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped Virtual Desktop Server Manager.
suggesting that systemd stopped vdsmd since mom did not start on time. We have an extremely ugly two-way dependency between the two services. Each depends on the other one. I am guessing that systemd's grace period for such a cycle has expired, and the couple was taken down.
Does anybody have a guess what could cause such slowlness?
Even if the LLDP errors does not have a negative impact of the functional behavior of supervdsmd, they have a impact timing behavior of the initialization of supervdsmd. I will post a patch today which moves the initialization of LLDP in an extra thread, to make the initialization of LLDP asynchronly.

On Wed, Aug 30, 2017 at 2:31 PM, Dominik Holler <dholler@redhat.com> wrote:
On Wed, 30 Aug 2017 14:18:49 +0300 Dan Kenigsberg <danken@redhat.com> wrote:
On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren <bkorren@redhat.com> wrote:
Test failed: [ 002_bootstrap.add_hosts ]
Link to suspected patches:
We suspect this is due to change to LLDPAD in upstream CentOS repos. We can't tell the exact point it was introduced because other ovirt-engine regressions introduced too much noise into the system.
It also seems that the failure is not 100% reproducible sine we have runs that do not encounter it.
Link to Job: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151
(This job checked a specific patch to ovirt-log-collector but failure seems unrelated, and also happens on vanille 'tested' repo right now).
Link to all logs: VDSM logs seem to be most relevant here:
http://jenkins.ovirt.org/job/ovirt-master_change-queue- tester/2151/artifact/exported-artifacts/basic-suit-master- el7/test_logs/basic-suite-master/post-002_bootstrap.py/ lago-basic-suite-master-host-0/_var_log/vdsm/
Error snippet from log:
- From VDSM logs (note it exists very quickly):
<error>
2017-08-30 05:55:52,208-0400 INFO (MainThread) [vds] (PID: 4594) I am the actual vdsm 4.20.2-133.git0ce4485.el7.centos lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64) (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO (MainThread) [vds] VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254) 2017-08-30 05:55:52,253-0400 INFO (MainThread) [storage.check] Starting check service (check:92) 2017-08-30 05:55:52,257-0400 INFO (MainThread) [storage.Dispatcher] Starting StorageDispatcher... (dispatcher:48) 2017-08-30 05:55:52,257-0400 INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x45317520> (asyncevent:125) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] START registerDomainStateChangeCallback(callbackFunc=<functools.partial object at 0x2b47b50>) from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] FINISH registerDomainStateChangeCallback return=None from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [MOM] Preparing MOM interface (momIF:53) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [MOM] Using named unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:52,307-0400 INFO (MainThread) [vds] Setting channels' timeout to 30 seconds. (vmchannels:221) 2017-08-30 05:55:52,309-0400 INFO (vmrecovery) [vds] recovery: completed in 0s (clientIF:516) 2017-08-30 05:55:52,310-0400 INFO (MainThread) [vds.MultiProtocolAcceptor] Listening at :::54321 (protocoldetector:196) 2017-08-30 05:55:52,496-0400 INFO (http) [vds] Server running (http:58) 2017-08-30 05:55:52,742-0400 INFO (periodic/1) [vdsm.api] START repoStats(domains=()) from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46) 2017-08-30 05:55:52,743-0400 INFO (periodic/1) [vdsm.api] FINISH repoStats return={} from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52) 2017-08-30 05:55:52,743-0400 WARN (periodic/1) [throttled] MOM not available. (throttledlog:103) 2017-08-30 05:55:52,744-0400 WARN (periodic/1) [throttled] MOM not available, KSM stats will be missing. (throttledlog:103) 2017-08-30 05:55:55,043-0400 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:67) 2017-08-30 05:55:55,045-0400 INFO (MainThread) [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759) 2017-08-30 05:55:55,049-0400 INFO (MainThread) [vds] Stopping http server (http:79) 2017-08-30 05:55:55,049-0400 INFO (http) [vds] Server stopped (http:69) 2017-08-30 05:55:55,050-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:55,052-0400 INFO (MainThread) [vdsm.api] START prepareForShutdown(options=None) from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46) 2017-08-30 05:55:55,089-0400 INFO (MainThread) [storage.Monitor] Shutting down domain monitors (monitor:222) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [storage.check] Stopping check service (check:105) 2017-08-30 05:55:55,090-0400 INFO (check/loop) [storage.asyncevent] Stopping <EventLoop running=True closed=False at 0x45317520> (asyncevent:220) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [vdsm.api] FINISH prepareForShutdown return=None from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Stopping threads (vdsmd:159) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Exiting (vdsmd:170)
</error>
- From SuperVDSM logs:
<error>
MainThread::ERROR::2017-08-30 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable LLDP on eth1 Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line 51, in _lldp_init Lldp.enable_lldp_on_iface(device) File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line 30, in enable_lldp_on_iface lldptool.enable_lldp_on_iface(iface, rx_only) File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out, err, iface) EnableLldpError: (1, "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx' command timed out.\n", '', 'eth1') MainThread::DEBUG::2017-08-30 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool get-lldp -i eth0 adminStatus (cwd None)
</error>
The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1. They do not cause supervdsmd to stop, so they are not the reason for vdsm's failure to start. Something ELSE is doing this.
in http://jenkins.ovirt.org/job/ovirt-master_change-queue- tester/2151/artifact/exported-artifacts/basic-suit-master- el7/test_logs/basic-suite-master/post-002_bootstrap.py/ lago-basic-suite-master-host-0/_var_log/messages/*view*/ we have
Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available. Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available, KSM stats will be missing. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting Suspend Active Libvirt Guests... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started Suspend Active Libvirt Guests. Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: lago-basic-suite-master-host-0.lago.local Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: vdsm: Running run_final_hooks Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file while reading data: Input/output error Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped Virtual Desktop Server Manager.
suggesting that systemd stopped vdsmd since mom did not start on time. We have an extremely ugly two-way dependency between the two services. Each depends on the other one. I am guessing that systemd's grace period for such a cycle has expired, and the couple was taken down.
Does anybody have a guess what could cause such slowlness?
Even if the LLDP errors does not have a negative impact of the functional behavior of supervdsmd, they have a impact timing behavior of the initialization of supervdsmd.
I will post a patch today which moves the initialization of LLDP in an extra thread, to make the initialization of LLDP asynchronly.
I'm not sure if this is possible. The LLDP init is processed during the initialization step of supervdsm service and we depend on it to execute prior to enter the running stage. If we will use a thread in this case, the running stage will start but one of the inits may not end. Perhaps just do some specific checks on init to make it faster. (like check if we have the selinux issue and break early)
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On Wed, 30 Aug 2017 13:31:38 +0200 Dominik Holler <dholler@redhat.com> wrote:
On Wed, 30 Aug 2017 14:18:49 +0300 Dan Kenigsberg <danken@redhat.com> wrote:
On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren <bkorren@redhat.com> wrote:
Test failed: [ 002_bootstrap.add_hosts ]
Link to suspected patches:
We suspect this is due to change to LLDPAD in upstream CentOS repos. We can't tell the exact point it was introduced because other ovirt-engine regressions introduced too much noise into the system.
It also seems that the failure is not 100% reproducible sine we have runs that do not encounter it.
Link to Job: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151
(This job checked a specific patch to ovirt-log-collector but failure seems unrelated, and also happens on vanille 'tested' repo right now).
Link to all logs: VDSM logs seem to be most relevant here:
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/...
Error snippet from log:
- From VDSM logs (note it exists very quickly):
<error>
2017-08-30 05:55:52,208-0400 INFO (MainThread) [vds] (PID: 4594) I am the actual vdsm 4.20.2-133.git0ce4485.el7.centos lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64) (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO (MainThread) [vds] VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254) 2017-08-30 05:55:52,253-0400 INFO (MainThread) [storage.check] Starting check service (check:92) 2017-08-30 05:55:52,257-0400 INFO (MainThread) [storage.Dispatcher] Starting StorageDispatcher... (dispatcher:48) 2017-08-30 05:55:52,257-0400 INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x45317520> (asyncevent:125) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] START registerDomainStateChangeCallback(callbackFunc=<functools.partial object at 0x2b47b50>) from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] FINISH registerDomainStateChangeCallback return=None from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [MOM] Preparing MOM interface (momIF:53) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [MOM] Using named unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:52,307-0400 INFO (MainThread) [vds] Setting channels' timeout to 30 seconds. (vmchannels:221) 2017-08-30 05:55:52,309-0400 INFO (vmrecovery) [vds] recovery: completed in 0s (clientIF:516) 2017-08-30 05:55:52,310-0400 INFO (MainThread) [vds.MultiProtocolAcceptor] Listening at :::54321 (protocoldetector:196) 2017-08-30 05:55:52,496-0400 INFO (http) [vds] Server running (http:58) 2017-08-30 05:55:52,742-0400 INFO (periodic/1) [vdsm.api] START repoStats(domains=()) from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46) 2017-08-30 05:55:52,743-0400 INFO (periodic/1) [vdsm.api] FINISH repoStats return={} from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52) 2017-08-30 05:55:52,743-0400 WARN (periodic/1) [throttled] MOM not available. (throttledlog:103) 2017-08-30 05:55:52,744-0400 WARN (periodic/1) [throttled] MOM not available, KSM stats will be missing. (throttledlog:103) 2017-08-30 05:55:55,043-0400 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:67) 2017-08-30 05:55:55,045-0400 INFO (MainThread) [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759) 2017-08-30 05:55:55,049-0400 INFO (MainThread) [vds] Stopping http server (http:79) 2017-08-30 05:55:55,049-0400 INFO (http) [vds] Server stopped (http:69) 2017-08-30 05:55:55,050-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:55,052-0400 INFO (MainThread) [vdsm.api] START prepareForShutdown(options=None) from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46) 2017-08-30 05:55:55,089-0400 INFO (MainThread) [storage.Monitor] Shutting down domain monitors (monitor:222) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [storage.check] Stopping check service (check:105) 2017-08-30 05:55:55,090-0400 INFO (check/loop) [storage.asyncevent] Stopping <EventLoop running=True closed=False at 0x45317520> (asyncevent:220) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [vdsm.api] FINISH prepareForShutdown return=None from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Stopping threads (vdsmd:159) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Exiting (vdsmd:170)
</error>
- From SuperVDSM logs:
<error>
MainThread::ERROR::2017-08-30 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable LLDP on eth1 Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line 51, in _lldp_init Lldp.enable_lldp_on_iface(device) File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line 30, in enable_lldp_on_iface lldptool.enable_lldp_on_iface(iface, rx_only) File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out, err, iface) EnableLldpError: (1, "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx' command timed out.\n", '', 'eth1') MainThread::DEBUG::2017-08-30 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool get-lldp -i eth0 adminStatus (cwd None)
</error>
The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1. They do not cause supervdsmd to stop, so they are not the reason for vdsm's failure to start. Something ELSE is doing this.
in http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/... we have
Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available. Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available, KSM stats will be missing. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting Suspend Active Libvirt Guests... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started Suspend Active Libvirt Guests. Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: lago-basic-suite-master-host-0.lago.local Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: vdsm: Running run_final_hooks Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file while reading data: Input/output error Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped Virtual Desktop Server Manager.
suggesting that systemd stopped vdsmd since mom did not start on time. We have an extremely ugly two-way dependency between the two services. Each depends on the other one. I am guessing that systemd's grace period for such a cycle has expired, and the couple was taken down.
Does anybody have a guess what could cause such slowlness?
Even if the LLDP errors does not have a negative impact of the functional behavior of supervdsmd, they have a impact timing behavior of the initialization of supervdsmd.
I will post a patch today which moves the initialization of LLDP in an extra thread, to make the initialization of LLDP asynchronly.
I analyzed this issue further by running OST [1] while disabled [2] the suspected initialization of LLDP in supervdsmd. Like expected, there are no log entries about enabling LLDP in [3], but the OST is still failing with same symptoms. So the reason for this failure is still unknown. [1] http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_... [2] https://gerrit.ovirt.org/#/c/81232/1/lib/vdsm/network/initializer.py [3] http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_...

On Wed, Aug 30, 2017 at 6:47 PM, Dominik Holler <dholler@redhat.com> wrote:
On Wed, 30 Aug 2017 13:31:38 +0200 Dominik Holler <dholler@redhat.com> wrote:
On Wed, 30 Aug 2017 14:18:49 +0300 Dan Kenigsberg <danken@redhat.com> wrote:
On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren <bkorren@redhat.com> wrote:
Test failed: [ 002_bootstrap.add_hosts ]
Link to suspected patches:
We suspect this is due to change to LLDPAD in upstream CentOS repos. We can't tell the exact point it was introduced because other ovirt-engine regressions introduced too much noise into the system.
It also seems that the failure is not 100% reproducible sine we have runs that do not encounter it.
Link to Job: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151
(This job checked a specific patch to ovirt-log-collector but failure seems unrelated, and also happens on vanille 'tested' repo right now).
Link to all logs: VDSM logs seem to be most relevant here:
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/...
Error snippet from log:
- From VDSM logs (note it exists very quickly):
<error>
2017-08-30 05:55:52,208-0400 INFO (MainThread) [vds] (PID: 4594) I am the actual vdsm 4.20.2-133.git0ce4485.el7.centos lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64) (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO (MainThread) [vds] VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254) 2017-08-30 05:55:52,253-0400 INFO (MainThread) [storage.check] Starting check service (check:92) 2017-08-30 05:55:52,257-0400 INFO (MainThread) [storage.Dispatcher] Starting StorageDispatcher... (dispatcher:48) 2017-08-30 05:55:52,257-0400 INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x45317520> (asyncevent:125) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] START registerDomainStateChangeCallback(callbackFunc=<functools.partial object at 0x2b47b50>) from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] FINISH registerDomainStateChangeCallback return=None from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [MOM] Preparing MOM interface (momIF:53) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [MOM] Using named unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:52,307-0400 INFO (MainThread) [vds] Setting channels' timeout to 30 seconds. (vmchannels:221) 2017-08-30 05:55:52,309-0400 INFO (vmrecovery) [vds] recovery: completed in 0s (clientIF:516) 2017-08-30 05:55:52,310-0400 INFO (MainThread) [vds.MultiProtocolAcceptor] Listening at :::54321 (protocoldetector:196) 2017-08-30 05:55:52,496-0400 INFO (http) [vds] Server running (http:58) 2017-08-30 05:55:52,742-0400 INFO (periodic/1) [vdsm.api] START repoStats(domains=()) from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46) 2017-08-30 05:55:52,743-0400 INFO (periodic/1) [vdsm.api] FINISH repoStats return={} from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52) 2017-08-30 05:55:52,743-0400 WARN (periodic/1) [throttled] MOM not available. (throttledlog:103) 2017-08-30 05:55:52,744-0400 WARN (periodic/1) [throttled] MOM not available, KSM stats will be missing. (throttledlog:103) 2017-08-30 05:55:55,043-0400 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:67) 2017-08-30 05:55:55,045-0400 INFO (MainThread) [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759) 2017-08-30 05:55:55,049-0400 INFO (MainThread) [vds] Stopping http server (http:79) 2017-08-30 05:55:55,049-0400 INFO (http) [vds] Server stopped (http:69) 2017-08-30 05:55:55,050-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:55,052-0400 INFO (MainThread) [vdsm.api] START prepareForShutdown(options=None) from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46) 2017-08-30 05:55:55,089-0400 INFO (MainThread) [storage.Monitor] Shutting down domain monitors (monitor:222) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [storage.check] Stopping check service (check:105) 2017-08-30 05:55:55,090-0400 INFO (check/loop) [storage.asyncevent] Stopping <EventLoop running=True closed=False at 0x45317520> (asyncevent:220) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [vdsm.api] FINISH prepareForShutdown return=None from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Stopping threads (vdsmd:159) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Exiting (vdsmd:170)
</error>
- From SuperVDSM logs:
<error>
MainThread::ERROR::2017-08-30 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable LLDP on eth1 Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line 51, in _lldp_init Lldp.enable_lldp_on_iface(device) File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line 30, in enable_lldp_on_iface lldptool.enable_lldp_on_iface(iface, rx_only) File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out, err, iface) EnableLldpError: (1, "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx' command timed out.\n", '', 'eth1') MainThread::DEBUG::2017-08-30 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool get-lldp -i eth0 adminStatus (cwd None)
</error>
The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1. They do not cause supervdsmd to stop, so they are not the reason for vdsm's failure to start. Something ELSE is doing this.
in http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/... we have
Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available. Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available, KSM stats will be missing. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting Suspend Active Libvirt Guests... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started Suspend Active Libvirt Guests. Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: lago-basic-suite-master-host-0.lago.local Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: vdsm: Running run_final_hooks Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file while reading data: Input/output error Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped Virtual Desktop Server Manager.
suggesting that systemd stopped vdsmd since mom did not start on time. We have an extremely ugly two-way dependency between the two services. Each depends on the other one. I am guessing that systemd's grace period for such a cycle has expired, and the couple was taken down.
Does anybody have a guess what could cause such slowlness?
Even if the LLDP errors does not have a negative impact of the functional behavior of supervdsmd, they have a impact timing behavior of the initialization of supervdsmd.
I will post a patch today which moves the initialization of LLDP in an extra thread, to make the initialization of LLDP asynchronly.
I analyzed this issue further by running OST [1] while disabled [2] the suspected initialization of LLDP in supervdsmd. Like expected, there are no log entries about enabling LLDP in [3], but the OST is still failing with same symptoms.
So the reason for this failure is still unknown.
[1] http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_...
[2] https://gerrit.ovirt.org/#/c/81232/1/lib/vdsm/network/initializer.py
[3] http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_...
So we're back in square one. Another possible culprit may be ansible: Vdsm is stopped two seconds after it logs to the host. Aug 30 11:26:24 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 11:26:26 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... could it be that it triggers a systemd-reload that makes systemd croak on the vdsm-mom cycle?

On Wed, Aug 30, 2017 at 8:43 PM, Dan Kenigsberg <danken@redhat.com> wrote:
On Wed, Aug 30, 2017 at 6:47 PM, Dominik Holler <dholler@redhat.com> wrote:
On Wed, 30 Aug 2017 13:31:38 +0200 Dominik Holler <dholler@redhat.com> wrote:
On Wed, 30 Aug 2017 14:18:49 +0300 Dan Kenigsberg <danken@redhat.com> wrote:
On Wed, Aug 30, 2017 at 1:40 PM, Barak Korren <bkorren@redhat.com> wrote:
Test failed: [ 002_bootstrap.add_hosts ]
Link to suspected patches:
We suspect this is due to change to LLDPAD in upstream CentOS repos. We can't tell the exact point it was introduced because other ovirt-engine regressions introduced too much noise into the system.
It also seems that the failure is not 100% reproducible sine we have runs that do not encounter it.
Link to Job: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151
(This job checked a specific patch to ovirt-log-collector but failure seems unrelated, and also happens on vanille 'tested' repo right now).
Link to all logs: VDSM logs seem to be most relevant here:
http://jenkins.ovirt.org/job/ovirt-master_change-queue- tester/2151/artifact/exported-artifacts/basic-suit-master- el7/test_logs/basic-suite-master/post-002_bootstrap.py/ lago-basic-suite-master-host-0/_var_log/vdsm/
Error snippet from log:
- From VDSM logs (note it exists very quickly):
<error>
2017-08-30 05:55:52,208-0400 INFO (MainThread) [vds] (PID: 4594) I am the actual vdsm 4.20.2-133.git0ce4485.el7.centos lago-basic-suite-master-host-0 (3.10.0-514.2.2.el7.x86_64) (vdsmd:148) 2017-08-30 05:55:52,209-0400 INFO (MainThread) [vds] VDSM will run with cpu affinity: frozenset([1]) (vdsmd:254) 2017-08-30 05:55:52,253-0400 INFO (MainThread) [storage.check] Starting check service (check:92) 2017-08-30 05:55:52,257-0400 INFO (MainThread) [storage.Dispatcher] Starting StorageDispatcher... (dispatcher:48) 2017-08-30 05:55:52,257-0400 INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x45317520> (asyncevent:125) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] START registerDomainStateChangeCallback(callbackFunc=<functools.partial object at 0x2b47b50>) from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:46) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [vdsm.api] FINISH registerDomainStateChangeCallback return=None from=internal, task_id=2cebe9ef-358e-47d7-81a6-8c4b54b9cd6d (api:52) 2017-08-30 05:55:52,288-0400 INFO (MainThread) [MOM] Preparing MOM interface (momIF:53) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [MOM] Using named unix socket /var/run/vdsm/mom-vdsm.sock (momIF:62) 2017-08-30 05:55:52,289-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:52,307-0400 INFO (MainThread) [vds] Setting channels' timeout to 30 seconds. (vmchannels:221) 2017-08-30 05:55:52,309-0400 INFO (vmrecovery) [vds] recovery: completed in 0s (clientIF:516) 2017-08-30 05:55:52,310-0400 INFO (MainThread) [vds.MultiProtocolAcceptor] Listening at :::54321 (protocoldetector:196) 2017-08-30 05:55:52,496-0400 INFO (http) [vds] Server running (http:58) 2017-08-30 05:55:52,742-0400 INFO (periodic/1) [vdsm.api] START repoStats(domains=()) from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:46) 2017-08-30 05:55:52,743-0400 INFO (periodic/1) [vdsm.api] FINISH repoStats return={} from=internal, task_id=85768015-8ecb-48e3-9307-f671bfc33c65 (api:52) 2017-08-30 05:55:52,743-0400 WARN (periodic/1) [throttled] MOM not available. (throttledlog:103) 2017-08-30 05:55:52,744-0400 WARN (periodic/1) [throttled] MOM not available, KSM stats will be missing. (throttledlog:103) 2017-08-30 05:55:55,043-0400 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:67) 2017-08-30 05:55:55,045-0400 INFO (MainThread) [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:759) 2017-08-30 05:55:55,049-0400 INFO (MainThread) [vds] Stopping http server (http:79) 2017-08-30 05:55:55,049-0400 INFO (http) [vds] Server stopped (http:69) 2017-08-30 05:55:55,050-0400 INFO (MainThread) [root] Unregistering all secrets (secret:91) 2017-08-30 05:55:55,052-0400 INFO (MainThread) [vdsm.api] START prepareForShutdown(options=None) from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:46) 2017-08-30 05:55:55,089-0400 INFO (MainThread) [storage.Monitor] Shutting down domain monitors (monitor:222) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [storage.check] Stopping check service (check:105) 2017-08-30 05:55:55,090-0400 INFO (check/loop) [storage.asyncevent] Stopping <EventLoop running=True closed=False at 0x45317520> (asyncevent:220) 2017-08-30 05:55:55,090-0400 INFO (MainThread) [vdsm.api] FINISH prepareForShutdown return=None from=internal, task_id=ffde5caa-fa44-49ab-bdd1-df81519680a3 (api:52) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Stopping threads (vdsmd:159) 2017-08-30 05:55:55,091-0400 INFO (MainThread) [vds] Exiting (vdsmd:170)
</error>
- From SuperVDSM logs:
<error>
MainThread::ERROR::2017-08-30 05:55:59,476::initializer::53::root::(_lldp_init) Failed to enable LLDP on eth1 Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/network/initializer.py", line 51, in _lldp_init Lldp.enable_lldp_on_iface(device) File "/usr/lib/python2.7/site-packages/vdsm/network/lldp/lldpad.py", line 30, in enable_lldp_on_iface lldptool.enable_lldp_on_iface(iface, rx_only) File "/usr/lib/python2.7/site-packages/vdsm/network/lldpad/lldptool.py", line 46, in enable_lldp_on_iface raise EnableLldpError(rc, out, err, iface) EnableLldpError: (1, "timeout\n'M00000001C3040000000c04eth1000badminStatus0002rx' command timed out.\n", '', 'eth1') MainThread::DEBUG::2017-08-30 05:55:59,477::cmdutils::133::root::(exec_cmd) /usr/sbin/lldptool get-lldp -i eth0 adminStatus (cwd None)
</error>
The LLDP errors are due to SELinux, to be fixed in CentOS-7.4.1. They do not cause supervdsmd to stop, so they are not the reason for vdsm's failure to start. Something ELSE is doing this.
in http://jenkins.ovirt.org/job/ovirt-master_change-queue- tester/2151/artifact/exported-artifacts/basic-suit-master- el7/test_logs/basic-suite-master/post-002_bootstrap.py/ lago-basic-suite-master-host-0/_var_log/messages/*view*/ we have
Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available. Aug 30 05:55:52 lago-basic-suite-master-host-0 journal: vdsm throttled WARN MOM not available, KSM stats will be missing. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Started Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting Suspend Active Libvirt Guests... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started Suspend Active Libvirt Guests. Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: lago-basic-suite-master-host-0.lago.local Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: vdsm: Running run_final_hooks Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file while reading data: Input/output error Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped Virtual Desktop Server Manager.
suggesting that systemd stopped vdsmd since mom did not start on time. We have an extremely ugly two-way dependency between the two services. Each depends on the other one. I am guessing that systemd's grace period for such a cycle has expired, and the couple was taken down.
Does anybody have a guess what could cause such slowlness?
Even if the LLDP errors does not have a negative impact of the functional behavior of supervdsmd, they have a impact timing behavior of the initialization of supervdsmd.
I will post a patch today which moves the initialization of LLDP in an extra thread, to make the initialization of LLDP asynchronly.
I analyzed this issue further by running OST [1] while disabled [2] the suspected initialization of LLDP in supervdsmd. Like expected, there are no log entries about enabling LLDP in [3], but the OST is still failing with same symptoms.
So the reason for this failure is still unknown.
[1] http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ ovirt-system-tests_manual/1045/
[2] https://gerrit.ovirt.org/#/c/81232/1/lib/vdsm/network/initializer.py
[3] http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ ovirt-system-tests_manual/1045/artifact/exported- artifacts/test_logs/basic-suite-master/post-002_ bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/ supervdsm.log/*view*/
So we're back in square one. Another possible culprit may be ansible: Vdsm is stopped two seconds after it logs to the host.
Aug 30 11:26:24 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 11:26:26 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager...
could it be that it triggers a systemd-reload that makes systemd croak on the vdsm-mom cycle?
We are not restarting VDSM within ovirt-host-deploy Ansible role, the VDSM restart is performed in host-deploy part same as in previous versions. Within ovirt-host-deploy-firewalld we only enable and restart firewalld service.

On 30 August 2017 at 22:20, Martin Perina <mperina@redhat.com> wrote:
So we're back in square one. Another possible culprit may be ansible: Vdsm is stopped two seconds after it logs to the host.
Aug 30 11:26:24 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 11:26:26 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager...
could it be that it triggers a systemd-reload that makes systemd croak on the vdsm-mom cycle?
We are not restarting VDSM within ovirt-host-deploy Ansible role, the VDSM restart is performed in host-deploy part same as in previous versions.
Within ovirt-host-deploy-firewalld we only enable and restart firewalld service.
comparing a successful add-host flow [1] to a failed one [2] we notice that in the failed add host ansible logs in twice (session 10 and session 11). Could it be somehow related? Notice that Session 11 uses the OLD way (awk+grep based) to find vdsm's version. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting Suspend Active Libvirt Guests... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started Suspend Active Libvirt Guests. Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: lago-basic-suite-master-host-0.lago.local Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: vdsm: Running run_final_hooks Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file while reading data: Input/output error Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped Virtual Desktop Server Manager. Aug 30 05:55:55 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -q vdsm --qf '%{VERSION}'" removes=None creates=None chdir=None Aug 30 05:55:55 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=iptables enabled=False daemon_reload=False state=stopped user=False masked=None [1]: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2197/artifact/... [2]: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/... -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

The difference between the two, is this patch: https://github.com/oVirt/ovirt-ansible/pull/62/files#diff-f7199e24fce211caec... Could the libvirt-guest restart cause this issue? Anyway can you try to remove ovirt-ansible-roles-1.1.1-0.1.master.20170830145950.el7.centos.noarch.rpm from tested and see if it reproduces? On Thu, Aug 31, 2017 at 8:55 AM, Barak Korren <bkorren@redhat.com> wrote:
On 30 August 2017 at 22:20, Martin Perina <mperina@redhat.com> wrote:
So we're back in square one. Another possible culprit may be ansible: Vdsm is stopped two seconds after it logs to the host.
Aug 30 11:26:24 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 11:26:26 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager...
could it be that it triggers a systemd-reload that makes systemd croak on the vdsm-mom cycle?
We are not restarting VDSM within ovirt-host-deploy Ansible role, the VDSM restart is performed in host-deploy part same as in previous versions.
Within ovirt-host-deploy-firewalld we only enable and restart firewalld service.
comparing a successful add-host flow [1] to a failed one [2] we notice that in the failed add host ansible logs in twice (session 10 and session 11). Could it be somehow related? Notice that Session 11 uses the OLD way (awk+grep based) to find vdsm's version.
Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting Suspend Active Libvirt Guests... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started Suspend Active Libvirt Guests. Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: lago-basic-suite-master-host-0.lago.local Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: vdsm: Running run_final_hooks Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file while reading data: Input/output error Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped Virtual Desktop Server Manager. Aug 30 05:55:55 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -q vdsm --qf '%{VERSION}'" removes=None creates=None chdir=None Aug 30 05:55:55 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=iptables enabled=False daemon_reload=False state=stopped user=False masked=None
[1]: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2197/artifact/...
[2]: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/2151/artifact/...
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

So with ovirt-ansible-roles-1.1.0 (which is the last offically relased version) everything runs fine and host is added properly (tested several times on CentOS 7.3) But we need to fix executing builds from github otherwise we cannot continue working with ovirt-ansible-roles in github: 1. If you add comment 'ci build please' to github PR, then build will be executed and the result will be a repo with RPM to be used in OST (but this build will not be passed to queue to be added to tested repo) 2. When PR is merged, then either automatically or by some other comment/action (available to maintainers only) build will be executed and if build is OK, it can be queued to be added to tested repo Without above we just cannot continue working on ovirt-ansible-roles on github. Thanks Martin On Thu, Aug 31, 2017 at 8:55 AM, Barak Korren <bkorren@redhat.com> wrote:
On 30 August 2017 at 22:20, Martin Perina <mperina@redhat.com> wrote:
So we're back in square one. Another possible culprit may be ansible: Vdsm is stopped two seconds after it logs to the host.
Aug 30 11:26:24 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 11:26:26 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager...
could it be that it triggers a systemd-reload that makes systemd croak on the vdsm-mom cycle?
We are not restarting VDSM within ovirt-host-deploy Ansible role, the
VDSM
restart is performed in host-deploy part same as in previous versions.
Within ovirt-host-deploy-firewalld we only enable and restart firewalld service.
comparing a successful add-host flow [1] to a failed one [2] we notice that in the failed add host ansible logs in twice (session 10 and session 11). Could it be somehow related? Notice that Session 11 uses the OLD way (awk+grep based) to find vdsm's version.
Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting Suspend Active Libvirt Guests... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started Suspend Active Libvirt Guests. Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: lago-basic-suite-master-host-0.lago.local Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: vdsm: Running run_final_hooks Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file while reading data: Input/output error Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped Virtual Desktop Server Manager. Aug 30 05:55:55 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -q vdsm --qf '%{VERSION}'" removes=None creates=None chdir=None Aug 30 05:55:55 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=iptables enabled=False daemon_reload=False state=stopped user=False masked=None
[1]: http://jenkins.ovirt.org/job/ovirt-master_change-queue- tester/2197/artifact/exported-artifacts/basic-suit-master- el7/test_logs/basic-suite-master/post-002_bootstrap.py/ lago-basic-suite-master-host-0/_var_log/messages/*view*/
[2]: http://jenkins.ovirt.org/job/ovirt-master_change-queue- tester/2151/artifact/exported-artifacts/basic-suit-master- el7/test_logs/basic-suite-master/post-002_bootstrap.py/ lago-basic-suite-master-host-0/_var_log/messages/*view*/
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

Barak - can you prioritize fixing this one? It blocks us from adding additional functionality into the ovirt-ansible-roles repo. On Thu, Aug 31, 2017 at 11:54 AM, Martin Perina <mperina@redhat.com> wrote:
So with ovirt-ansible-roles-1.1.0 (which is the last offically relased version) everything runs fine and host is added properly (tested several times on CentOS 7.3)
But we need to fix executing builds from github otherwise we cannot continue working with ovirt-ansible-roles in github:
1. If you add comment 'ci build please' to github PR, then build will be executed and the result will be a repo with RPM to be used in OST (but this build will not be passed to queue to be added to tested repo)
2. When PR is merged, then either automatically or by some other comment/action (available to maintainers only) build will be executed and if build is OK, it can be queued to be added to tested repo
Without above we just cannot continue working on ovirt-ansible-roles on github.
Thanks
Martin
On Thu, Aug 31, 2017 at 8:55 AM, Barak Korren <bkorren@redhat.com> wrote:
On 30 August 2017 at 22:20, Martin Perina <mperina@redhat.com> wrote:
So we're back in square one. Another possible culprit may be ansible: Vdsm is stopped two seconds after it logs to the host.
Aug 30 11:26:24 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 11:26:25 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 11:26:26 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 11:26:26 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager...
could it be that it triggers a systemd-reload that makes systemd croak on the vdsm-mom cycle?
We are not restarting VDSM within ovirt-host-deploy Ansible role, the
VDSM
restart is performed in host-deploy part same as in previous versions.
Within ovirt-host-deploy-firewalld we only enable and restart firewalld service.
comparing a successful add-host flow [1] to a failed one [2] we notice that in the failed add host ansible logs in twice (session 10 and session 11). Could it be somehow related? Notice that Session 11 uses the OLD way (awk+grep based) to find vdsm's version.
Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd-logind: New session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 systemd: Starting Session 10 of user root. Aug 30 05:55:53 lago-basic-suite-master-host-0 python: ansible-setup Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -qi vdsm | grep -oE 'Version\\s+:\\s+[0-9\\.]+' | awk '{print $3}'" removes=None creates=None chdir=None Aug 30 05:55:54 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=libvirt-guests enabled=True daemon_reload=False state=started user=False masked=None Aug 30 05:55:54 lago-basic-suite-master-host-0 systemd: Reloading. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped MOM instance configured for VDSM purposes. Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopping Virtual Desktop Server Manager... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Starting Suspend Active Libvirt Guests... Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Started Suspend Active Libvirt Guests. Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: libvirt version: 2.0.0, package: 10.el7_3.9 (CentOS BuildSystem <http://bugs.centos.org>, 2017-05-25-20:52:28, c1bm.rdu2.centos.org) Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: hostname: lago-basic-suite-master-host-0.lago.local Aug 30 05:55:55 lago-basic-suite-master-host-0 vdsmd_init_common.sh: vdsm: Running run_final_hooks Aug 30 05:55:55 lago-basic-suite-master-host-0 journal: End of file while reading data: Input/output error Aug 30 05:55:55 lago-basic-suite-master-host-0 systemd: Stopped Virtual Desktop Server Manager. Aug 30 05:55:55 lago-basic-suite-master-host-0 python: ansible-command Invoked with warn=True executable=None _uses_shell=False _raw_params=bash -c "rpm -q vdsm --qf '%{VERSION}'" removes=None creates=None chdir=None Aug 30 05:55:55 lago-basic-suite-master-host-0 python: ansible-systemd Invoked with no_block=False name=iptables enabled=False daemon_reload=False state=stopped user=False masked=None
[1]: http://jenkins.ovirt.org/job/ovirt-master_change-queue-teste r/2197/artifact/exported-artifacts/basic-suit-master-el7/ test_logs/basic-suite-master/post-002_bootstrap.py/lago- basic-suite-master-host-0/_var_log/messages/*view*/
[2]: http://jenkins.ovirt.org/job/ovirt-master_change-queue-teste r/2151/artifact/exported-artifacts/basic-suit-master-el7/ test_logs/basic-suite-master/post-002_bootstrap.py/lago- basic-suite-master-host-0/_var_log/messages/*view*/
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On 31 August 2017 at 12:06, Oved Ourfali <oourfali@redhat.com> wrote:
Barak - can you prioritize fixing this one? It blocks us from adding additional functionality into the ovirt-ansible-roles repo.
It is currently my #1 task. But dousing fires comes first as ususal... -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

On 31 August 2017 at 11:54, Martin Perina <mperina@redhat.com> wrote:
So with ovirt-ansible-roles-1.1.0 (which is the last offically relased version) everything runs fine and host is added properly (tested several times on CentOS 7.3)
But we need to fix executing builds from github otherwise we cannot continue working with ovirt-ansible-roles in github:
1. If you add comment 'ci build please' to github PR, then build will be executed and the result will be a repo with RPM to be used in OST (but this build will not be passed to queue to be added to tested repo)
2. When PR is merged, then either automatically or by some other comment/action (available to maintainers only) build will be executed and if build is OK, it can be queued to be added to tested repo
Without above we just cannot continue working on ovirt-ansible-roles on github.
We are working on that, but given that GitHub triggering work's quite differently then Gerrit's, we need to rewrite some significant chunks of code. Tracker ticket is [1] or more specifically [2], [3] and [4]. We've some pending patches for this, but we've not finished implementing all required functionality yet. [1]: https://ovirt-jira.atlassian.net/browse/OVIRT-1548 [2]: https://ovirt-jira.atlassian.net/browse/OVIRT-1547 [3]: https://ovirt-jira.atlassian.net/browse/OVIRT-1522 [4]: https://ovirt-jira.atlassian.net/browse/OVIRT-1486 As a temporary measure, we can disable the automated deployment from jobs that run on PR messages. But this means that we will need to do the deployment manually when needed. -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

On Thu, Aug 31, 2017 at 11:09 AM, Barak Korren <bkorren@redhat.com> wrote:
So with ovirt-ansible-roles-1.1.0 (which is the last offically relased version) everything runs fine and host is added properly (tested several times on CentOS 7.3)
But we need to fix executing builds from github otherwise we cannot continue working with ovirt-ansible-roles in github:
1. If you add comment 'ci build please' to github PR, then build will be executed and the result will be a repo with RPM to be used in OST (but
On 31 August 2017 at 11:54, Martin Perina <mperina@redhat.com> wrote: this
build will not be passed to queue to be added to tested repo)
2. When PR is merged, then either automatically or by some other comment/action (available to maintainers only) build will be executed and if build is OK, it can be queued to be added to tested repo
Without above we just cannot continue working on ovirt-ansible-roles on github.
We are working on that, but given that GitHub triggering work's quite differently then Gerrit's, we need to rewrite some significant chunks of code.
Tracker ticket is [1] or more specifically [2], [3] and [4]. We've some pending patches for this, but we've not finished implementing all required functionality yet.
[1]: https://ovirt-jira.atlassian.net/browse/OVIRT-1548 [2]: https://ovirt-jira.atlassian.net/browse/OVIRT-1547 [3]: https://ovirt-jira.atlassian.net/browse/OVIRT-1522 [4]: https://ovirt-jira.atlassian.net/browse/OVIRT-1486
As a temporary measure, we can disable the automated deployment from jobs that run on PR messages. But this means that we will need to do the deployment manually when needed.
Seems like good idea, as we really need to add to tested queue only merged PRs. How exactly will we be able to perform that manually?
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

On 31 August 2017 at 12:16, Martin Perina <mperina@redhat.com> wrote:
On Thu, Aug 31, 2017 at 11:09 AM, Barak Korren <bkorren@redhat.com> wrote:
As a temporary measure, we can disable the automated deployment from jobs that run on PR messages. But this means that we will need to do the deployment manually when needed.
Seems like good idea, as we really need to add to tested queue only merged PRs. How exactly will we be able to perform that manually?
For the 1st few times, please ask me to do it. After that I'll either have automation in place or be certain enough about how to do it to be able to explain it to someone else. Actually I think that if we go for this, I might be able to replace your existing jobs with the new code (It supports 99% of what the existing code can do with PRs), and at least be rid of the nagging bot messages for new PR submitters. -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

On Thu, Aug 31, 2017 at 11:22 AM, Barak Korren <bkorren@redhat.com> wrote:
On 31 August 2017 at 12:16, Martin Perina <mperina@redhat.com> wrote:
On Thu, Aug 31, 2017 at 11:09 AM, Barak Korren <bkorren@redhat.com>
wrote:
As a temporary measure, we can disable the automated deployment from jobs that run on PR messages. But this means that we will need to do the deployment manually when needed.
Seems like good idea, as we really need to add to tested queue only merged PRs. How exactly will we be able to perform that manually?
For the 1st few times, please ask me to do it. After that I'll either have automation in place or be certain enough about how to do it to be able to explain it to someone else.
Actually I think that if we go for this, I might be able to replace your existing jobs with the new code (It supports 99% of what the existing code can do with PRs), and at least be rid of the nagging bot messages for new PR submitters.
OK, so please let's try this approach. Thanks Martin
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

On 31 August 2017 at 12:25, Martin Perina <mperina@redhat.com> wrote:
On Thu, Aug 31, 2017 at 11:22 AM, Barak Korren <bkorren@redhat.com> wrote:
On 31 August 2017 at 12:16, Martin Perina <mperina@redhat.com> wrote:
On Thu, Aug 31, 2017 at 11:09 AM, Barak Korren <bkorren@redhat.com> wrote:
As a temporary measure, we can disable the automated deployment from jobs that run on PR messages. But this means that we will need to do the deployment manually when needed.
Seems like good idea, as we really need to add to tested queue only merged PRs. How exactly will we be able to perform that manually?
For the 1st few times, please ask me to do it. After that I'll either have automation in place or be certain enough about how to do it to be able to explain it to someone else.
Actually I think that if we go for this, I might be able to replace your existing jobs with the new code (It supports 99% of what the existing code can do with PRs), and at least be rid of the nagging bot messages for new PR submitters.
OK, so please let's try this approach.
Ok. This patch makes the existing jobs not deploy: https://gerrit.ovirt.org/#/c/81268/ I don't want to risk making the existing jobs not work for the weekend so I will not replace them with new code today. Instead I will set them to run side by side with existing jobs. I'm only seeing jobs for ovirt-ansible. Are we supposed to also have jobs for 'ovirt-ansible-modules'? -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
participants (7)
-
Barak Korren
-
Dan Kenigsberg
-
Dominik Holler
-
Edward Haas
-
Martin Perina
-
Ondra Machacek
-
Oved Ourfali