On Sun, Dec 26, 2021 at 12:24 PM <martin(a)fulmo.org> wrote:
Hi list!
on a hyperconverged cluster with three hosts I am unable to start the ovirt-ha-agent.
The history:
As all three hosts were running Centos 8, I tried to upgrade host3 to Centos 8 Stream
first and left all VMs and host1 and host2 untouched, basically as a test. After all
migrations of VMs to host3 failed with:
```
qemu-kvm: error while loading state for instance 0x0 of device
'0000:00:01.0/pcie-root-port'#0122021-12-24T00:56:49.428234Z
qemu-kvm: load of migration failed: Invalid argument
```
IIRC something similar was reported on the lists - that you can't
(always? easily?) migrate VMs between CentOS Linux 8 (.3? .4? not
sure) and current Stream. Is this mandatory for you? If not, you might
test on a test env stopping/starting your VMs and decide this is good
enough.
and since I haven't had the time to dig into that, I decided to roll back the upgrade
and rebooted host3 into Centos 8 again and re-installed host3 through the engine
appliance. During that process (and the restart of host3) the engine appliance became
unresponsive and crashed.
Perhaps provide more details, if you have them. Did you put host3 to
maintenance? Remove it? etc.
The problem:
Currently all ovirt-ha-agent services on all hosts fail with the following message in
/var/log/ovirt-hosted-engine-ha/agent.log
```
MainThread::INFO::2021-12-24
03:56:03,500::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
ovirt-hosted-engine-ha agent 2.4.9 started
MainThread::INFO::2021-12-24
03:56:03,516::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
Certificate common name not found, using hostname to identify host
MainThread::INFO::2021-12-24
03:56:03,575::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Initializing ha-broker connection
MainThread::INFO::2021-12-24
03:56:03,576::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
Starting monitor network, options {'addr': 'GATEWAY_IP',
'network_test': 'dns', 'tcp_t_address': '',
'tcp_t_port': ''}
Not sure where 'GATEWAY_IP' comes from, but it should be the actual IP
address, e.g.:
MainThread::INFO::2021-12-20
07:51:05,151::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
Starting monitor network, options {'addr': '192.168.201.1',
'network_test': 'dns', 'tcp_t_address': '',
'tcp_t_port': ''}
Check (and perhaps fix manually, if you can't/do not want to first
diagnose/fix your reinstallation) the line 'gateway=' in
/etc/ovirt-hosted-engine/hosted-engine.conf . Perhaps compare this
file to your other hosts - only the line 'host_id' should be different
between them.
MainThread::ERROR::2021-12-24
03:56:03,577::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Failed to start necessary monitors
```
Now I've stumbled upon this one
[
1984262](https://bugzilla.redhat.com/show_bug.cgi?id=1984262) but it doesn't seem to
apply. All hosts resolve properly, all hosts also have proper hostnames set, unique
/etc/hosts entries and proper A records set (in the form of
hostname.subdomain.domain.tld).
The versions involved are:
```
[root@host2 ~]# rpm -qa ovirt*
ovirt-hosted-engine-setup-2.5.4-2.el8.noarch
ovirt-imageio-daemon-2.3.0-1.el8.x86_64
ovirt-host-dependencies-4.4.9-2.el8.x86_64
ovirt-vmconsole-1.0.9-1.el8.noarch
ovirt-imageio-client-2.3.0-1.el8.x86_64
ovirt-host-4.4.9-2.el8.x86_64
ovirt-python-openvswitch-2.11-1.el8.noarch
ovirt-openvswitch-ovn-host-2.11-1.el8.noarch
ovirt-provider-ovn-driver-1.2.34-1.el8.noarch
ovirt-openvswitch-ovn-2.11-1.el8.noarch
ovirt-release44-4.4.9.2-1.el8.noarch
ovirt-openvswitch-2.11-1.el8.noarch
ovirt-ansible-collection-1.6.5-1.el8.noarch
ovirt-openvswitch-ovn-common-2.11-1.el8.noarch
ovirt-hosted-engine-ha-2.4.9-1.el8.noarch
ovirt-vmconsole-host-1.0.9-1.el8.noarch
ovirt-imageio-common-2.3.0-1.el8.x86_64
```
Any hint how to fix this is really appreciated. I'd like to get the engine appliance
back, remove host 3 and re-initialize it since this is a production cluster (with hosts 1
and 2 replicating the gluster storage and host 3 acting as an arbiter).
OK. Good luck and best regards,
--
Didi