
On Sun, Dec 26, 2021 at 12:24 PM <martin@fulmo.org> wrote:
Hi list!
on a hyperconverged cluster with three hosts I am unable to start the ovirt-ha-agent.
The history:
As all three hosts were running Centos 8, I tried to upgrade host3 to Centos 8 Stream first and left all VMs and host1 and host2 untouched, basically as a test. After all migrations of VMs to host3 failed with:
``` qemu-kvm: error while loading state for instance 0x0 of device '0000:00:01.0/pcie-root-port'#0122021-12-24T00:56:49.428234Z qemu-kvm: load of migration failed: Invalid argument ```
IIRC something similar was reported on the lists - that you can't (always? easily?) migrate VMs between CentOS Linux 8 (.3? .4? not sure) and current Stream. Is this mandatory for you? If not, you might test on a test env stopping/starting your VMs and decide this is good enough.
and since I haven't had the time to dig into that, I decided to roll back the upgrade and rebooted host3 into Centos 8 again and re-installed host3 through the engine appliance. During that process (and the restart of host3) the engine appliance became unresponsive and crashed.
Perhaps provide more details, if you have them. Did you put host3 to maintenance? Remove it? etc.
The problem:
Currently all ovirt-ha-agent services on all hosts fail with the following message in /var/log/ovirt-hosted-engine-ha/agent.log
``` MainThread::INFO::2021-12-24 03:56:03,500::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.9 started MainThread::INFO::2021-12-24 03:56:03,516::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::INFO::2021-12-24 03:56:03,575::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2021-12-24 03:56:03,576::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': 'GATEWAY_IP', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}
Not sure where 'GATEWAY_IP' comes from, but it should be the actual IP address, e.g.: MainThread::INFO::2021-12-20 07:51:05,151::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': '192.168.201.1', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''} Check (and perhaps fix manually, if you can't/do not want to first diagnose/fix your reinstallation) the line 'gateway=' in /etc/ovirt-hosted-engine/hosted-engine.conf . Perhaps compare this file to your other hosts - only the line 'host_id' should be different between them.
MainThread::ERROR::2021-12-24 03:56:03,577::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors ```
Now I've stumbled upon this one [1984262](https://bugzilla.redhat.com/show_bug.cgi?id=1984262) but it doesn't seem to apply. All hosts resolve properly, all hosts also have proper hostnames set, unique /etc/hosts entries and proper A records set (in the form of hostname.subdomain.domain.tld).
The versions involved are:
``` [root@host2 ~]# rpm -qa ovirt* ovirt-hosted-engine-setup-2.5.4-2.el8.noarch ovirt-imageio-daemon-2.3.0-1.el8.x86_64 ovirt-host-dependencies-4.4.9-2.el8.x86_64 ovirt-vmconsole-1.0.9-1.el8.noarch ovirt-imageio-client-2.3.0-1.el8.x86_64 ovirt-host-4.4.9-2.el8.x86_64 ovirt-python-openvswitch-2.11-1.el8.noarch ovirt-openvswitch-ovn-host-2.11-1.el8.noarch ovirt-provider-ovn-driver-1.2.34-1.el8.noarch ovirt-openvswitch-ovn-2.11-1.el8.noarch ovirt-release44-4.4.9.2-1.el8.noarch ovirt-openvswitch-2.11-1.el8.noarch ovirt-ansible-collection-1.6.5-1.el8.noarch ovirt-openvswitch-ovn-common-2.11-1.el8.noarch ovirt-hosted-engine-ha-2.4.9-1.el8.noarch ovirt-vmconsole-host-1.0.9-1.el8.noarch ovirt-imageio-common-2.3.0-1.el8.x86_64 ```
Any hint how to fix this is really appreciated. I'd like to get the engine appliance back, remove host 3 and re-initialize it since this is a production cluster (with hosts 1 and 2 replicating the gluster storage and host 3 acting as an arbiter).
OK. Good luck and best regards, -- Didi