On Sun, Aug 20, 2017 at 11:08 AM Dan Kenigsberg <danken@redhat.com> wrote:
On Sun, Aug 20, 2017 at 10:39 AM, Yaniv Kaul <ykaul@redhat.com> wrote:
>
>
> On Sun, Aug 20, 2017 at 8:48 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
>>
>> Failed test: basic_suite_master/002_bootstrap
>> Version: oVirt Master
>> Link to failed job: ovirt-master_change-queue-tester/1860/
>> Link to logs (Jenkins): test logs
>> Suspected patch: https://gerrit.ovirt.org/#/c/80749/3
>>
>> From what I was able to find, It seems that for some reason VDSM failed to
>> start on host 1. The VDSM log is empty, and the only error I could find in
>> supervdsm.log is that start of LLDP failed (Not sure if it's related)
>
>
> Can you check the networking on the hosts? Something's very strange there.
> For example:
> Aug 19 16:38:42 lago-basic-suite-master-host0 NetworkManager[685]: <info>
> [1503175122.2682] manager: (e7NZWeNDXwIjQia): new Bond device
> (/org/freedesktop/NetworkManager/Devices/17)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> Setting xmit hash policy to layer2+3 (2)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> Setting xmit hash policy to encap2+3 (3)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> Setting xmit hash policy to encap3+4 (4)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> option xmit_hash_policy: invalid value (5)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> Setting primary_reselect to always (0)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> Setting primary_reselect to better (1)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> Setting primary_reselect to failure (2)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> option primary_reselect: invalid value (3)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> Setting arp_all_targets to any (0)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> Setting arp_all_targets to all (1)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: e7NZWeNDXwIjQia:
> option arp_all_targets: invalid value (2)
> Aug 19 16:38:42 lago-basic-suite-master-host0 kernel: bonding:
> e7NZWeNDXwIjQia is being deleted...
> Aug 19 16:38:42 lago-basic-suite-master-host0 lldpad: recvfrom(Event
> interface): No buffer space available
>
> Y.



The post-boot noise with funny-looking bonds is due to our calling of
`vdsm-tool dump-bonding-options` every boot, in order to find the
bonding defaults for the current kernel.

>
>>
>> From host-deploy log:
>>
>> 2017-08-19 16:38:41,476-0400 DEBUG otopi.plugins.otopi.services.systemd
>> systemd.state:130 starting service vdsmd
>> 2017-08-19 16:38:41,476-0400 DEBUG otopi.plugins.otopi.services.systemd
>> plugin.executeRaw:813 execute: ('/bin/systemctl', 'start', 'vdsmd.service'),
>> executable='None', cwd='None', env=None
>> 2017-08-19 16:38:44,628-0400 DEBUG otopi.plugins.otopi.services.systemd
>> plugin.executeRaw:863 execute-result: ('/bin/systemctl', 'start',
>> 'vdsmd.service'), rc=1
>> 2017-08-19 16:38:44,630-0400 DEBUG otopi.plugins.otopi.services.systemd
>> plugin.execute:921 execute-output: ('/bin/systemctl', 'start',
>> 'vdsmd.service') stdout:
>>
>>
>> 2017-08-19 16:38:44,630-0400 DEBUG otopi.plugins.otopi.services.systemd
>> plugin.execute:926 execute-output: ('/bin/systemctl', 'start',
>> 'vdsmd.service') stderr:
>> Job for vdsmd.service failed because the control process exited with error
>> code. See "systemctl status vdsmd.service" and "journalctl -xe" for details.
>>
>> 2017-08-19 16:38:44,631-0400 DEBUG otopi.context
>> context._executeMethod:142 method exception
>> Traceback (most recent call last):
>>   File "/tmp/ovirt-dunwHj8Njn/pythonlib/otopi/context.py", line 132, in
>> _executeMethod
>>     method['method']()
>>   File
>> "/tmp/ovirt-dunwHj8Njn/otopi-plugins/ovirt-host-deploy/vdsm/packages.py",
>> line 224, in _start
>>     self.services.state('vdsmd', True)
>>   File "/tmp/ovirt-dunwHj8Njn/otopi-plugins/otopi/services/systemd.py",
>> line 141, in state
>>     service=name,
>> RuntimeError: Failed to start service 'vdsmd'
>>
>>
>> From /var/log/messages:
>>
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh: Error:
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh: One of
>> the modules is not configured to work with VDSM.
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh: To
>> configure the module use the following:
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh:
>> 'vdsm-tool configure [--module module-name]'.
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh: If all
>> modules are not configured try to use:
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh:
>> 'vdsm-tool configure --force'
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh: (The
>> force flag will stop the module's service and start it
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh:
>> afterwards automatically to load the new configuration.)
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh: abrt
>> is already configured for vdsm
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh: lvm is
>> configured for vdsm
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh:
>> libvirt is already configured for vdsm
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh:
>> multipath requires configuration
>> Aug 19 16:38:44 lago-basic-suite-master-host0 vdsmd_init_common.sh:
>> Modules sanlock, multipath are not configured

This means the host was not deployed correctly. When deploying vdsm
host deploy must run "vdsm-tool configure --force", which configures
multipath and sanlock.

We did not change anything in multipath and sanlock configurators lately.

Didi, can you check this?