[ovirt-users] Unable to start ovirt-ha-agent on all hosts

24 Dec 2021

      Hi list!

on a hyperconverged cluster with three hosts I am unable to start the ovirt-ha-agent.

The history:

As all three hosts were running Centos 8, I tried to upgrade host3 to Centos 8 Stream first and left all VMs and host1 and host2 untouched, basically as a test. After all migrations of VMs to host3 failed with: 

```
qemu-kvm: error while loading state for instance 0x0 of device '0000:00:01.0/pcie-root-port'#0122021-12-24T00:56:49.428234Z
qemu-kvm: load of migration failed: Invalid argument
```

and since I haven't had the time to dig into that, I decided to roll back the upgrade and rebooted host3 into Centos 8 again and re-installed host3 through the engine appliance. During that process (and the restart of host3) the engine appliance became unresponsive and crashed.

The problem:

Currently all ovirt-ha-agent services on all hosts fail with the following message in /var/log/ovirt-hosted-engine-ha/agent.log

```
MainThread::INFO::2021-12-24 03:56:03,500::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.9 started
MainThread::INFO::2021-12-24 03:56:03,516::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host
MainThread::INFO::2021-12-24 03:56:03,575::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection
MainThread::INFO::2021-12-24 03:56:03,576::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': 'GATEWAY_IP', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}
MainThread::ERROR::2021-12-24 03:56:03,577::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors
```

Now I've stumbled upon this one [1984262](https://bugzilla.redhat.com/show_bug.cgi?id=1984262) but it doesn't seem to apply. All hosts resolve properly, all hosts also have proper hostnames set, unique /etc/hosts entries and proper A records set (in the form of hostname.subdomain.domain.tld).

The versions involved are:

```
[root@host2 ~]# rpm -qa ovirt*
ovirt-hosted-engine-setup-2.5.4-2.el8.noarch
ovirt-imageio-daemon-2.3.0-1.el8.x86_64
ovirt-host-dependencies-4.4.9-2.el8.x86_64
ovirt-vmconsole-1.0.9-1.el8.noarch
ovirt-imageio-client-2.3.0-1.el8.x86_64
ovirt-host-4.4.9-2.el8.x86_64
ovirt-python-openvswitch-2.11-1.el8.noarch
ovirt-openvswitch-ovn-host-2.11-1.el8.noarch
ovirt-provider-ovn-driver-1.2.34-1.el8.noarch
ovirt-openvswitch-ovn-2.11-1.el8.noarch
ovirt-release44-4.4.9.2-1.el8.noarch
ovirt-openvswitch-2.11-1.el8.noarch
ovirt-ansible-collection-1.6.5-1.el8.noarch
ovirt-openvswitch-ovn-common-2.11-1.el8.noarch
ovirt-hosted-engine-ha-2.4.9-1.el8.noarch
ovirt-vmconsole-host-1.0.9-1.el8.noarch
ovirt-imageio-common-2.3.0-1.el8.x86_64
```

Any hint how to fix this is really appreciated. I'd like to get the engine appliance back, remove host 3 and re-initialize it since this is a production cluster (with hosts 1 and 2 replicating the gluster storage and host 3 acting as an arbiter).

Thanks in advance, Martin

[ovirt-users] Unable to start ovirt-ha-agent on all hosts

martin＠fulmo.org