[ovirt-users] ovirt node ng 4.3.0 rc1 and HCI single host problems

Friday, 11 January 2019

Let's start a new thread more focused on the subject

I'm just testing deployment of HCI single host using oVirt Node NG CentOS 7
iso

I was able to complete the gluster setup via cockpit with these
modifications:

1) I wanted to check via ssh and found that files *key under /etc/ssh/ had
too weak config so that ssh daemon didn't started after installation of
node from iso
changing to 600 and restarting the service was ok

2) I used a single disk configured as jbod, so I choose that option instead
of the default proposed RAID6

But the playook failed with
. . .
PLAY [gluster_servers]
*********************************************************

TASK [Create LVs with specified size for the VGs]
******************************
changed: [192.168.124.211] => (item={u'lv': u'gluster_thinpool_sdb',
u'size': u'50GB', u'extent': u'100%FREE', u'vg':
u'gluster_vg_sdb'})

PLAY RECAP
*********************************************************************
192.168.124.211            : ok=1    changed=1    unreachable=0
failed=0
Ignoring errors...
Error: Section diskcount not found in the configuration file

Reading inside the playbooks involved here:
/usr/share/gdeploy/playbooks/auto_lvcreate_for_gluster.yml
/usr/share/gdeploy/playbooks/vgcreate.yml

and the snippet

  - name: Convert the logical volume
    lv: action=convert thinpool={{ item.vg }}/{{item.pool }}
        poolmetadata={{ item.vg }}/'metadata' poolmetadataspare=n
        vgname={{ item.vg }} disktype="{{disktype}}"
        diskcount="{{ diskcount }}"
        stripesize="{{stripesize}}"
        chunksize="{{ chunksize | default('') }}"
        snapshot_reserve="{{ snapshot_reserve }}"
    with_items: "{{ lvpools }}"
    ignore_errors: yes

I simply edited the gdeploy.conf from the gui button adding this section
under the [disktype] one
"

[diskcount]
1

"
then clean lv/vg/pv and the gdeploy step completed successfully

3) at first stage of ansible deploy I have this failed command that seems
not to prevent from completion but that I have not understood..

PLAY [gluster_servers]
*********************************************************

TASK [Run a command in the shell]
**********************************************
failed: [192.168.124.211] (item=vdsm-tool configure --force) => {"changed":
true, "cmd": "vdsm-tool configure --force", "delta":
"0:00:01.475528",
"end": "2019-01-11 10:59:55.147601", "item": "vdsm-tool
configure --force",
"msg": "non-zero return code", "rc": 1, "start":
"2019-01-11
10:59:53.672073", "stderr": "Traceback (most recent call last):\n 
File
\"/usr/bin/vdsm-tool\", line 220, in main\n    return
tool_command[cmd][\"command\"](*args)\n  File
\"/usr/lib/python2.7/site-packages/vdsm/tool/__init__.py\", line 40, in
wrapper\n    func(*args, **kwargs)\n  File
\"/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py\", line 143,
in configure\n    _configure(c)\n  File
\"/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py\", line 90, in
_configure\n    getattr(module, 'configure', lambda: None)()\n  File
\"/usr/lib/python2.7/site-packages/vdsm/tool/configurators/bond_defaults.py\",
line 39, in configure\n    sysfs_options_mapper.dump_bonding_options()\n
File
\"/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py\",
line 48, in dump_bonding_options\n    with
open(sysfs_options.BONDING_DEFAULTS, 'w') as f:\nIOError: [Errno 2] No such
file or directory: '/var/run/vdsm/bonding-defaults.json'",
"stderr_lines":
["Traceback (most recent call last):", "  File
\"/usr/bin/vdsm-tool\", line
220, in main", "    return
tool_command[cmd][\"command\"](*args)", "  File
\"/usr/lib/python2.7/site-packages/vdsm/tool/__init__.py\", line 40, in
wrapper", "    func(*args, **kwargs)", "  File
\"/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py\", line 143,
in configure", "    _configure(c)", "  File
\"/usr/lib/python2.7/site-packages/vdsm/tool/configurator.py\", line 90, in
_configure", "    getattr(module, 'configure', lambda: None)()",
"  File
\"/usr/lib/python2.7/site-packages/vdsm/tool/configurators/bond_defaults.py\",
line 39, in configure", "    sysfs_options_mapper.dump_bonding_options()",
"  File
\"/usr/lib/python2.7/site-packages/vdsm/network/link/bond/sysfs_options_mapper.py\",
line 48, in dump_bonding_options", "    with
open(sysfs_options.BONDING_DEFAULTS, 'w') as f:", "IOError: [Errno 2]
No
such file or directory: '/var/run/vdsm/bonding-defaults.json'"],
"stdout":
"\nChecking configuration status...\n\nabrt is already configured for
vdsm\nlvm is configured for vdsm\nlibvirt is already configured for
vdsm\nSUCCESS: ssl configured to true. No conflicts\nManual override for
multipath.conf detected - preserving current configuration\nThis manual
override for multipath.conf was based on downrevved template. You are
strongly advised to contact your support representatives\n\nRunning
configure...\nReconfiguration of abrt is done.\nReconfiguration of passwd
is done.\nReconfiguration of libvirt is done.", "stdout_lines":
["",
"Checking configuration status...", "", "abrt is already
configured for
vdsm", "lvm is configured for vdsm", "libvirt is already configured
for
vdsm", "SUCCESS: ssl configured to true. No conflicts", "Manual
override
for multipath.conf detected - preserving current configuration", "This
manual override for multipath.conf was based on downrevved template. You
are strongly advised to contact your support representatives", "",
"Running
configure...", "Reconfiguration of abrt is done.", "Reconfiguration
of
passwd is done.", "Reconfiguration of libvirt is done."]}
to retry, use: --limit @/tmp/tmpQXe2el/shell_cmd.retry

PLAY RECAP
*********************************************************************
192.168.124.211            : ok=0    changed=0    unreachable=0
failed=1

Would it be possible to save in some way the ansible playbook log even if
it completes ok, without going directly to the "successful" page?
Or is anyway stored in some location on disk of host?

I then proceeded with Hosted Engine install/setup and

4) it fails here at final stages of the local vm engine setup during host
activation:

[ INFO ] TASK [oVirt.hosted-engine-setup : Set Engine public key as
authorized key without validating the TLS/SSL certificates]
[ INFO ] changed: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : include_tasks]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Obtain SSO token using
username/password credentials]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Ensure that the target
datacenter is present]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Ensure that the target cluster
is present in the target datacenter]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Enable GlusterFS at cluster
level]
[ INFO ] changed: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Set VLAN ID at datacenter level]
[ INFO ] skipping: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Force host-deploy in offline
mode]
[ INFO ] changed: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Add host]
[ INFO ] changed: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Wait for the host to be up]

then after several minutes:

 [ ERROR ] fatal: [localhost]: FAILED! => {"ansible_facts":
{"ovirt_hosts":
[{"address": "ov4301.localdomain.local", "affinity_labels":
[],
"auto_numa_status": "unknown", "certificate":
{"organization":
"localdomain.local", "subject":
"O=localdomain.local,CN=ov4301.localdomain.local"}, "cluster":
{"href":
"/ovirt-engine/api/clusters/5e8fea14-158b-11e9-b2f0-00163e29b9f2",
"id":
"5e8fea14-158b-11e9-b2f0-00163e29b9f2"}, "comment": "",
"cpu": {"speed":
0.0, "topology": {}}, "device_passthrough": {"enabled":
false}, "devices":
[], "external_network_provider_configurations": [],
"external_status":
"ok", "hardware_information": {"supported_rng_sources": []},
"hooks": [],
"href":
"/ovirt-engine/api/hosts/4202de75-75d3-4dcb-b128-2c4a1d257a15",
"id": "4202de75-75d3-4dcb-b128-2c4a1d257a15",
"katello_errata": [],
"kdump_status": "unknown", "ksm": {"enabled":
false},
"max_scheduling_memory": 0, "memory": 0, "name":
"ov4301.localdomain.local", "network_attachments": [],
"nics": [],
"numa_nodes": [], "numa_supported": false, "os":
{"custom_kernel_cmdline":
""}, "permissions": [], "port": 54321,
"power_management":
{"automatic_pm_enabled": true, "enabled": false,
"kdump_detection": true,
"pm_proxies": []}, "protocol": "stomp",
"se_linux": {}, "spm": {"priority":
5, "status": "none"}, "ssh": {"fingerprint":
"SHA256:iqeQjdWCm15+xe74xEnswrgRJF7JBAWrvsjO/RaW8q8", "port": 22},
"statistics": [], "status": "install_failed",
"storage_connection_extensions": [], "summary": {"total":
0}, "tags": [],
"transparent_huge_pages": {"enabled": false}, "type":
"ovirt_node",
"unmanaged_networks": [], "update_available": false}]},
"attempts": 120,
"changed": false}
[ INFO ] TASK [oVirt.hosted-engine-setup : Fetch logs from the engine VM]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Set destination directory path]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Create destination directory]
[ INFO ] changed: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : include_tasks]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Find the local appliance image]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : debug]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Set local_vm_disk_path]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Give the vm time to flush dirty
buffers]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Copy engine logs]
[ INFO ] TASK [oVirt.hosted-engine-setup : include_tasks]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Remove local vm dir]
[ INFO ] changed: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : debug]
[ INFO ] ok: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Remove temporary entry in
/etc/hosts for the local VM]
[ INFO ] changed: [localhost]
[ INFO ] TASK [oVirt.hosted-engine-setup : Notify the user about a failure]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg":
"The
system may not be provisioned according to the playbook results: please
check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}

Going to see the log
/var/log/ovirt-engine/host-deploy/ovirt-host-deploy-20190111113227-ov4301.localdomain.local-5d387e0d.log
it seems the error is about ovirt-imageio-daemon

2019-01-11 11:32:26,893+0100 DEBUG otopi.plugins.otopi.services.systemd
plugin.executeRaw:863 execute-result: ('/usr/bin/systemctl', 'start',
'ovirt-imageio-daemon.service'), rc=1
2019-01-11 11:32:26,894+0100 DEBUG otopi.plugins.otopi.services.systemd
plugin.execute:921 execute-output: ('/usr/bin/systemctl', 'start',
'ovirt-imageio-daemon.service') stdout:
2019-01-11 11:32:26,895+0100 DEBUG otopi.plugins.otopi.services.systemd
plugin.execute:926 execute-output: ('/usr/bin/systemctl', 'start',
'ovirt-imageio-daemon.service') stderr:
Job for ovirt-imageio-daemon.service failed because the control process
exited with error code. See "systemctl status ovirt-imageio-daemon.service"
and "journalctl -xe" for details.
2019-01-11 11:32:26,896+0100 DEBUG otopi.context context._executeMethod:143
method exception
Traceback (most recent call last):
  File "/tmp/ovirt-PBFI2dyoDO/pythonlib/otopi/context.py", line 133, in
_executeMethod
    method['method']()
  File
"/tmp/ovirt-PBFI2dyoDO/otopi-plugins/ovirt-host-deploy/vdsm/packages.py",
line 175, in _start
    self.services.state('ovirt-imageio-daemon', True)
  File "/tmp/ovirt-PBFI2dyoDO/otopi-plugins/otopi/services/systemd.py",
line 141, in state
    service=name,
RuntimeError: Failed to start service 'ovirt-imageio-daemon'
2019-01-11 11:32:26,898+0100 ERROR otopi.context context._executeMethod:152
Failed to execute stage 'Closing up': Failed to start service
'ovirt-imageio-daemon'
2019-01-11 11:32:26,899+0100 DEBUG otopi.plugins.otopi.dialog.machine
dialog.__logString:204 DIALOG:SEND       **%EventEnd STAGE closeup METHOD
otopi.plugins.ovirt_host_deploy.vdsm.packages.Plugin._start
(odeploycons.packages.vdsm.started)

The reason:

[root@ov4301 ~]# systemctl status ovirt-imageio-daemon -l
● ovirt-imageio-daemon.service - oVirt ImageIO Daemon
   Loaded: loaded (/usr/lib/systemd/system/ovirt-imageio-daemon.service;
disabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Fri 2019-01-11 11:32:29 CET;
27min ago
  Process: 11625 ExecStart=/usr/bin/ovirt-imageio-daemon (code=exited,
status=1/FAILURE)
 Main PID: 11625 (code=exited, status=1/FAILURE)

Jan 11 11:32:29 ov4301.localdomain.local systemd[1]:
ovirt-imageio-daemon.service: main process exited, code=exited,
status=1/FAILURE
Jan 11 11:32:29 ov4301.localdomain.local systemd[1]: Failed to start oVirt
ImageIO Daemon.
Jan 11 11:32:29 ov4301.localdomain.local systemd[1]: Unit
ovirt-imageio-daemon.service entered failed state.
Jan 11 11:32:29 ov4301.localdomain.local systemd[1]:
ovirt-imageio-daemon.service failed.
Jan 11 11:32:29 ov4301.localdomain.local systemd[1]:
ovirt-imageio-daemon.service holdoff time over, scheduling restart.
Jan 11 11:32:29 ov4301.localdomain.local systemd[1]: Stopped oVirt ImageIO
Daemon.
Jan 11 11:32:29 ov4301.localdomain.local systemd[1]: start request repeated
too quickly for ovirt-imageio-daemon.service
Jan 11 11:32:29 ov4301.localdomain.local systemd[1]: Failed to start oVirt
ImageIO Daemon.
Jan 11 11:32:29 ov4301.localdomain.local systemd[1]: Unit
ovirt-imageio-daemon.service entered failed state.
Jan 11 11:32:29 ov4301.localdomain.local systemd[1]:
ovirt-imageio-daemon.service failed.
[root@ov4301 ~]#

The file /var/log/ovirt-imageio-daemon/daemon.log contains

2019-01-11 10:28:30,191 INFO    (MainThread) [server] Starting (pid=3702,
version=1.4.6)
2019-01-11 10:28:30,229 ERROR   (MainThread) [server] Service failed
(remote_service=<ovirt_imageio_daemon.server.RemoteService object at
0x7fea9dc88050>, local_service=<ovirt_imageio_daemon.server.LocalService
object at 0x7fea9ca24850>, control_service=None, running=True)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/server.py",
line 58, in main
    start(config)
  File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/server.py",
line 99, in start
    control_service = ControlService(config)
  File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/server.py",
line 206, in __init__
    config.tickets.socket, uhttp.UnixWSGIRequestHandler)
  File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
    self.server_bind()
  File "/usr/lib/python2.7/site-packages/ovirt_imageio_daemon/uhttp.py",
line 79, in server_bind
    self.socket.bind(self.server_address)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 2] No such file or directory

One potential problem I noticed is that on this host I setup eth0 with
192.168.122.x (for ovirtmgmt) and eth1 with 192.168.124.y (for gluster,
even if only one host, but aiming at adding other 2 hosts in second step)
and the libvirt network temporarily created for the local engine vm is also
on 192.168.124.0 network.....

4: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state
UP group default qlen 1000
    link/ether 52:54:00:b8:6b:3c brd ff:ff:ff:ff:ff:ff
    inet 192.168.124.1/24 brd 192.168.124.255 scope global virbr0
       valid_lft forever preferred_lft forever
5: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master
virbr0 state DOWN group default qlen 1000
    link/ether 52:54:00:b8:6b:3c brd ff:ff:ff:ff:ff:ff

I can change my gluster network of this env and re-test, but would it be
possible to have the libvirt network configurable? It seems risky to have a
fixed one...

Can I go ahead from this failed hosted engine after understanding reason of
ovirt-imageio-daemon failure or am I forced to scratch?
Supposing I go to power down and then power on again this host, how can I
retry without scratching?

Gianluca

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] ovirt node ng 4.3.0 rc1 and HCI single host problems