Unable to start ovirt-ha-agent on all hosts

Hi list! on a hyperconverged cluster with three hosts I am unable to start the ovirt-ha-agent. The history: As all three hosts were running Centos 8, I tried to upgrade host3 to Centos 8 Stream first and left all VMs and host1 and host2 untouched, basically as a test. After all migrations of VMs to host3 failed with: ``` qemu-kvm: error while loading state for instance 0x0 of device '0000:00:01.0/pcie-root-port'#0122021-12-24T00:56:49.428234Z qemu-kvm: load of migration failed: Invalid argument ``` and since I haven't had the time to dig into that, I decided to roll back the upgrade and rebooted host3 into Centos 8 again and re-installed host3 through the engine appliance. During that process (and the restart of host3) the engine appliance became unresponsive and crashed. The problem: Currently all ovirt-ha-agent services on all hosts fail with the following message in /var/log/ovirt-hosted-engine-ha/agent.log ``` MainThread::INFO::2021-12-24 03:56:03,500::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.9 started MainThread::INFO::2021-12-24 03:56:03,516::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::INFO::2021-12-24 03:56:03,575::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2021-12-24 03:56:03,576::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': 'GATEWAY_IP', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''} MainThread::ERROR::2021-12-24 03:56:03,577::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors ``` Now I've stumbled upon this one [1984262](https://bugzilla.redhat.com/show_bug.cgi?id=1984262) but it doesn't seem to apply. All hosts resolve properly, all hosts also have proper hostnames set, unique /etc/hosts entries and proper A records set (in the form of hostname.subdomain.domain.tld). The versions involved are: ``` [root@host2 ~]# rpm -qa ovirt* ovirt-hosted-engine-setup-2.5.4-2.el8.noarch ovirt-imageio-daemon-2.3.0-1.el8.x86_64 ovirt-host-dependencies-4.4.9-2.el8.x86_64 ovirt-vmconsole-1.0.9-1.el8.noarch ovirt-imageio-client-2.3.0-1.el8.x86_64 ovirt-host-4.4.9-2.el8.x86_64 ovirt-python-openvswitch-2.11-1.el8.noarch ovirt-openvswitch-ovn-host-2.11-1.el8.noarch ovirt-provider-ovn-driver-1.2.34-1.el8.noarch ovirt-openvswitch-ovn-2.11-1.el8.noarch ovirt-release44-4.4.9.2-1.el8.noarch ovirt-openvswitch-2.11-1.el8.noarch ovirt-ansible-collection-1.6.5-1.el8.noarch ovirt-openvswitch-ovn-common-2.11-1.el8.noarch ovirt-hosted-engine-ha-2.4.9-1.el8.noarch ovirt-vmconsole-host-1.0.9-1.el8.noarch ovirt-imageio-common-2.3.0-1.el8.x86_64 ``` Any hint how to fix this is really appreciated. I'd like to get the engine appliance back, remove host 3 and re-initialize it since this is a production cluster (with hosts 1 and 2 replicating the gluster storage and host 3 acting as an arbiter). Thanks in advance, Martin

On Sun, Dec 26, 2021 at 12:24 PM <martin@fulmo.org> wrote:
Hi list!
on a hyperconverged cluster with three hosts I am unable to start the ovirt-ha-agent.
The history:
As all three hosts were running Centos 8, I tried to upgrade host3 to Centos 8 Stream first and left all VMs and host1 and host2 untouched, basically as a test. After all migrations of VMs to host3 failed with:
``` qemu-kvm: error while loading state for instance 0x0 of device '0000:00:01.0/pcie-root-port'#0122021-12-24T00:56:49.428234Z qemu-kvm: load of migration failed: Invalid argument ```
IIRC something similar was reported on the lists - that you can't (always? easily?) migrate VMs between CentOS Linux 8 (.3? .4? not sure) and current Stream. Is this mandatory for you? If not, you might test on a test env stopping/starting your VMs and decide this is good enough.
and since I haven't had the time to dig into that, I decided to roll back the upgrade and rebooted host3 into Centos 8 again and re-installed host3 through the engine appliance. During that process (and the restart of host3) the engine appliance became unresponsive and crashed.
Perhaps provide more details, if you have them. Did you put host3 to maintenance? Remove it? etc.
The problem:
Currently all ovirt-ha-agent services on all hosts fail with the following message in /var/log/ovirt-hosted-engine-ha/agent.log
``` MainThread::INFO::2021-12-24 03:56:03,500::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.9 started MainThread::INFO::2021-12-24 03:56:03,516::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::INFO::2021-12-24 03:56:03,575::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2021-12-24 03:56:03,576::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': 'GATEWAY_IP', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}
Not sure where 'GATEWAY_IP' comes from, but it should be the actual IP address, e.g.: MainThread::INFO::2021-12-20 07:51:05,151::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': '192.168.201.1', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''} Check (and perhaps fix manually, if you can't/do not want to first diagnose/fix your reinstallation) the line 'gateway=' in /etc/ovirt-hosted-engine/hosted-engine.conf . Perhaps compare this file to your other hosts - only the line 'host_id' should be different between them.
MainThread::ERROR::2021-12-24 03:56:03,577::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors ```
Now I've stumbled upon this one [1984262](https://bugzilla.redhat.com/show_bug.cgi?id=1984262) but it doesn't seem to apply. All hosts resolve properly, all hosts also have proper hostnames set, unique /etc/hosts entries and proper A records set (in the form of hostname.subdomain.domain.tld).
The versions involved are:
``` [root@host2 ~]# rpm -qa ovirt* ovirt-hosted-engine-setup-2.5.4-2.el8.noarch ovirt-imageio-daemon-2.3.0-1.el8.x86_64 ovirt-host-dependencies-4.4.9-2.el8.x86_64 ovirt-vmconsole-1.0.9-1.el8.noarch ovirt-imageio-client-2.3.0-1.el8.x86_64 ovirt-host-4.4.9-2.el8.x86_64 ovirt-python-openvswitch-2.11-1.el8.noarch ovirt-openvswitch-ovn-host-2.11-1.el8.noarch ovirt-provider-ovn-driver-1.2.34-1.el8.noarch ovirt-openvswitch-ovn-2.11-1.el8.noarch ovirt-release44-4.4.9.2-1.el8.noarch ovirt-openvswitch-2.11-1.el8.noarch ovirt-ansible-collection-1.6.5-1.el8.noarch ovirt-openvswitch-ovn-common-2.11-1.el8.noarch ovirt-hosted-engine-ha-2.4.9-1.el8.noarch ovirt-vmconsole-host-1.0.9-1.el8.noarch ovirt-imageio-common-2.3.0-1.el8.x86_64 ```
Any hint how to fix this is really appreciated. I'd like to get the engine appliance back, remove host 3 and re-initialize it since this is a production cluster (with hosts 1 and 2 replicating the gluster storage and host 3 acting as an arbiter).
OK. Good luck and best regards, -- Didi

Hi Didi!
On Sun, Dec 26, 2021 at 12:24 PM <martin(a)fulmo.org> wrote:
IIRC something similar was reported on the lists - that you can't (always? easily?) migrate VMs between CentOS Linux 8 (.3? .4? not sure) and current Stream. Is this mandatory for you? If not, you might test on a test env stopping/starting your VMs and decide this is good enough.
Of course I tried, instead of migrating the VMs, simply stopping them on the Centos 8.5 hosts and starting them on the Centos Stream 8 host. It did not work, the network was not reachable but there was no error thrown in the engine appliance, no event, nothing.
Perhaps provide more details, if you have them. Did you put host3 to maintenance? Remove it? etc.
With everything I performed host3 was put into maintenance first. It has not been removed. I planned to remove it but then the error below (Unable to start ovirt-ha-agent on all hosts) started appearing.
Not sure where 'GATEWAY_IP' comes from, but it should be the actual IP address, e.g.:
I just replaced the actual gateway IP in the error message thrown with 'GATEWAY_IP', since the error message thrown does expose the gateway IP instead of the respective host IP. This issue does confirm that finding: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DTEY6OTBIWNX2D...
MainThread::INFO::2021-12-20 07:51:05,151::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': '192.168.201.1', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}
Check (and perhaps fix manually, if you can't/do not want to first diagnose/fix your reinstallation) the line 'gateway=' in /etc/ovirt-hosted-engine/hosted-engine.conf . Perhaps compare this file to your other hosts - only the line 'host_id' should be different between them.
Everything in /etc/ovirt-hosted-engine/hosted-engine.conf is as you describe it. Only the host_id is different. I haven't touched that file either. Cheers, Martin

On Tue, Dec 28, 2021 at 7:39 AM <martin@fulmo.org> wrote:
Hi Didi!
On Sun, Dec 26, 2021 at 12:24 PM <martin(a)fulmo.org> wrote:
IIRC something similar was reported on the lists - that you can't (always? easily?) migrate VMs between CentOS Linux 8 (.3? .4? not sure) and current Stream. Is this mandatory for you? If not, you might test on a test env stopping/starting your VMs and decide this is good enough.
Of course I tried, instead of migrating the VMs, simply stopping them on the Centos 8.5 hosts and starting them on the Centos Stream 8 host. It did not work, the network was not reachable but there was no error thrown in the engine appliance, no event, nothing.
Perhaps provide more details, if you have them. Did you put host3 to maintenance? Remove it? etc.
With everything I performed host3 was put into maintenance first. It has not been removed. I planned to remove it but then the error below (Unable to start ovirt-ha-agent on all hosts) started appearing.
Not sure where 'GATEWAY_IP' comes from, but it should be the actual IP address, e.g.:
I just replaced the actual gateway IP in the error message thrown with 'GATEWAY_IP', since the error message thrown does expose the gateway IP instead of the respective host IP. This issue does confirm that finding: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DTEY6OTBIWNX2D...
MainThread::INFO::2021-12-20 07:51:05,151::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': '192.168.201.1', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}
Check (and perhaps fix manually, if you can't/do not want to first diagnose/fix your reinstallation) the line 'gateway=' in /etc/ovirt-hosted-engine/hosted-engine.conf . Perhaps compare this file to your other hosts - only the line 'host_id' should be different between them.
Everything in /etc/ovirt-hosted-engine/hosted-engine.conf is as you describe it. Only the host_id is different. I haven't touched that file either.
Can you please check/share also broker.log? Thanks. Best regards, -- Didi

On Tue, Dec 28, 2021 at 9:06 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Dec 28, 2021 at 7:39 AM <martin@fulmo.org> wrote:
Hi Didi!
On Sun, Dec 26, 2021 at 12:24 PM <martin(a)fulmo.org> wrote:
IIRC something similar was reported on the lists - that you can't (always? easily?) migrate VMs between CentOS Linux 8 (.3? .4? not sure) and current Stream. Is this mandatory for you? If not, you might test on a test env stopping/starting your VMs and decide this is good enough.
Of course I tried, instead of migrating the VMs, simply stopping them on the Centos 8.5 hosts and starting them on the Centos Stream 8 host. It did not work, the network was not reachable but there was no error thrown in the engine appliance, no event, nothing.
Perhaps provide more details, if you have them. Did you put host3 to maintenance? Remove it? etc.
With everything I performed host3 was put into maintenance first. It has not been removed. I planned to remove it but then the error below (Unable to start ovirt-ha-agent on all hosts) started appearing.
Not sure where 'GATEWAY_IP' comes from, but it should be the actual IP address, e.g.:
I just replaced the actual gateway IP in the error message thrown with 'GATEWAY_IP', since the error message thrown does expose the gateway IP instead of the respective host IP. This issue does confirm that finding: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DTEY6OTBIWNX2D...
MainThread::INFO::2021-12-20 07:51:05,151::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': '192.168.201.1', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}
Check (and perhaps fix manually, if you can't/do not want to first diagnose/fix your reinstallation) the line 'gateway=' in /etc/ovirt-hosted-engine/hosted-engine.conf . Perhaps compare this file to your other hosts - only the line 'host_id' should be different between them.
Everything in /etc/ovirt-hosted-engine/hosted-engine.conf is as you describe it. Only the host_id is different. I haven't touched that file either.
Can you please check/share also broker.log? Thanks.
Also, you can try enabling debug-level logging by editing /etc/ovirt-hosted-engine-ha/*log.conf, setting 'level=DEBUG' under '[logger_root]' and restarting the services. Best regards, -- Didi

Hi Didi,
Can you please check/share also broker.log? Thanks.
I did that. Turns out that ... ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException: path to storage domain e1f61a9f-0c93-4d01-8f6f-7f8a5470ee2f not found in /rhev/data-center/mnt/glusterSD ... and I noticed that the glusterd service was not started on host3 (vendor setting was set to disabled). After starting the glusterd service the ovirt-ha-agent services recovered, the hosted-engine could be started and then it blew my mind: While I was switching host3 into maintenance, I did not notice that the hosted-engine marked host1 "non-responsive" (although the host was fine) and scheduled the migration of host1 VMs to host3. By setting host3 to maintenance the migration of scheduled VMs was cancelled but two VMs were migrated, so they were migrated (back) to host2. Now this is the result: VM xyz is down with error. Exit message: internal error: process exited while connecting to monitor: 2021-12-28T06:33:19.011352Z qemu-kvm: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}: qcow2: Image is corrupt; cannot be opened read/write. 12/28/217:33:21 AM Trying to repair the image with "qemu-img check -r all" failed. What an experience. Maybe I'm too stupid for this.

On Tue, Dec 28, 2021 at 9:37 AM <martin@fulmo.org> wrote:
Hi Didi,
Can you please check/share also broker.log? Thanks.
I did that. Turns out that ...
ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException: path to storage domain e1f61a9f-0c93-4d01-8f6f-7f8a5470ee2f not found in /rhev/data-center/mnt/glusterSD
... and I noticed that the glusterd service was not started on host3 (vendor setting was set to disabled). After starting the glusterd service the ovirt-ha-agent services recovered, the hosted-engine could be started and then it blew my mind:
While I was switching host3 into maintenance, I did not notice that the hosted-engine marked host1 "non-responsive" (although the host was fine) and scheduled the migration of host1 VMs to host3. By setting host3 to maintenance the migration of scheduled VMs was cancelled but two VMs were migrated, so they were migrated (back) to host2.
Now this is the result:
VM xyz is down with error. Exit message: internal error: process exited while connecting to monitor: 2021-12-28T06:33:19.011352Z qemu-kvm: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}: qcow2: Image is corrupt; cannot be opened read/write. 12/28/217:33:21 AM
Trying to repair the image with "qemu-img check -r all" failed.
What an experience. Maybe I'm too stupid for this.
Sorry. You might want to ask specifically about the corruption, perhaps starting another thread on this list with a suitable subject, or on gluster or qemu mailing lists. Good luck and best regards, -- Didi
participants (2)
-
martin@fulmo.org
-
Yedidyah Bar David