
Hi list! on a hyperconverged cluster with three hosts I am unable to start the ovirt-ha-agent. The history: As all three hosts were running Centos 8, I tried to upgrade host3 to Centos 8 Stream first and left all VMs and host1 and host2 untouched, basically as a test. After all migrations of VMs to host3 failed with: ``` qemu-kvm: error while loading state for instance 0x0 of device '0000:00:01.0/pcie-root-port'#0122021-12-24T00:56:49.428234Z qemu-kvm: load of migration failed: Invalid argument ``` and since I haven't had the time to dig into that, I decided to roll back the upgrade and rebooted host3 into Centos 8 again and re-installed host3 through the engine appliance. During that process (and the restart of host3) the engine appliance became unresponsive and crashed. The problem: Currently all ovirt-ha-agent services on all hosts fail with the following message in /var/log/ovirt-hosted-engine-ha/agent.log ``` MainThread::INFO::2021-12-24 03:56:03,500::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.9 started MainThread::INFO::2021-12-24 03:56:03,516::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::INFO::2021-12-24 03:56:03,575::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2021-12-24 03:56:03,576::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': 'GATEWAY_IP', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''} MainThread::ERROR::2021-12-24 03:56:03,577::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors ``` Now I've stumbled upon this one [1984262](https://bugzilla.redhat.com/show_bug.cgi?id=1984262) but it doesn't seem to apply. All hosts resolve properly, all hosts also have proper hostnames set, unique /etc/hosts entries and proper A records set (in the form of hostname.subdomain.domain.tld). The versions involved are: ``` [root@host2 ~]# rpm -qa ovirt* ovirt-hosted-engine-setup-2.5.4-2.el8.noarch ovirt-imageio-daemon-2.3.0-1.el8.x86_64 ovirt-host-dependencies-4.4.9-2.el8.x86_64 ovirt-vmconsole-1.0.9-1.el8.noarch ovirt-imageio-client-2.3.0-1.el8.x86_64 ovirt-host-4.4.9-2.el8.x86_64 ovirt-python-openvswitch-2.11-1.el8.noarch ovirt-openvswitch-ovn-host-2.11-1.el8.noarch ovirt-provider-ovn-driver-1.2.34-1.el8.noarch ovirt-openvswitch-ovn-2.11-1.el8.noarch ovirt-release44-4.4.9.2-1.el8.noarch ovirt-openvswitch-2.11-1.el8.noarch ovirt-ansible-collection-1.6.5-1.el8.noarch ovirt-openvswitch-ovn-common-2.11-1.el8.noarch ovirt-hosted-engine-ha-2.4.9-1.el8.noarch ovirt-vmconsole-host-1.0.9-1.el8.noarch ovirt-imageio-common-2.3.0-1.el8.x86_64 ``` Any hint how to fix this is really appreciated. I'd like to get the engine appliance back, remove host 3 and re-initialize it since this is a production cluster (with hosts 1 and 2 replicating the gluster storage and host 3 acting as an arbiter). Thanks in advance, Martin