Hi list!
on a hyperconverged cluster with three hosts I am unable to start the ovirt-ha-agent.
The history:
As all three hosts were running Centos 8, I tried to upgrade host3 to Centos 8 Stream
first and left all VMs and host1 and host2 untouched, basically as a test. After all
migrations of VMs to host3 failed with:
```
qemu-kvm: error while loading state for instance 0x0 of device
'0000:00:01.0/pcie-root-port'#0122021-12-24T00:56:49.428234Z
qemu-kvm: load of migration failed: Invalid argument
```
and since I haven't had the time to dig into that, I decided to roll back the upgrade
and rebooted host3 into Centos 8 again and re-installed host3 through the engine
appliance. During that process (and the restart of host3) the engine appliance became
unresponsive and crashed.
The problem:
Currently all ovirt-ha-agent services on all hosts fail with the following message in
/var/log/ovirt-hosted-engine-ha/agent.log
```
MainThread::INFO::2021-12-24
03:56:03,500::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
ovirt-hosted-engine-ha agent 2.4.9 started
MainThread::INFO::2021-12-24
03:56:03,516::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
Certificate common name not found, using hostname to identify host
MainThread::INFO::2021-12-24
03:56:03,575::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Initializing ha-broker connection
MainThread::INFO::2021-12-24
03:56:03,576::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
Starting monitor network, options {'addr': 'GATEWAY_IP',
'network_test': 'dns', 'tcp_t_address': '',
'tcp_t_port': ''}
MainThread::ERROR::2021-12-24
03:56:03,577::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Failed to start necessary monitors
```
Now I've stumbled upon this one
[
1984262](https://bugzilla.redhat.com/show_bug.cgi?id=1984262) but it doesn't seem to
apply. All hosts resolve properly, all hosts also have proper hostnames set, unique
/etc/hosts entries and proper A records set (in the form of
hostname.subdomain.domain.tld).
The versions involved are:
```
[root@host2 ~]# rpm -qa ovirt*
ovirt-hosted-engine-setup-2.5.4-2.el8.noarch
ovirt-imageio-daemon-2.3.0-1.el8.x86_64
ovirt-host-dependencies-4.4.9-2.el8.x86_64
ovirt-vmconsole-1.0.9-1.el8.noarch
ovirt-imageio-client-2.3.0-1.el8.x86_64
ovirt-host-4.4.9-2.el8.x86_64
ovirt-python-openvswitch-2.11-1.el8.noarch
ovirt-openvswitch-ovn-host-2.11-1.el8.noarch
ovirt-provider-ovn-driver-1.2.34-1.el8.noarch
ovirt-openvswitch-ovn-2.11-1.el8.noarch
ovirt-release44-4.4.9.2-1.el8.noarch
ovirt-openvswitch-2.11-1.el8.noarch
ovirt-ansible-collection-1.6.5-1.el8.noarch
ovirt-openvswitch-ovn-common-2.11-1.el8.noarch
ovirt-hosted-engine-ha-2.4.9-1.el8.noarch
ovirt-vmconsole-host-1.0.9-1.el8.noarch
ovirt-imageio-common-2.3.0-1.el8.x86_64
```
Any hint how to fix this is really appreciated. I'd like to get the engine appliance
back, remove host 3 and re-initialize it since this is a production cluster (with hosts 1
and 2 replicating the gluster storage and host 3 acting as an arbiter).
Thanks in advance, Martin