
Hello, I’ve an strange issue with oVirt 4.4.1 The hosted engine is stuck in the UEFI firmware trying to “never” boot. I think this happened when I changed the default VM mode for the cluster inside the datacenter. There’s a way to fix this without redeploying the engine?

On Sun, Aug 23, 2020 at 7:45 AM Vinícius Ferrão via Users <users@ovirt.org> wrote:
Hello, I’ve an strange issue with oVirt 4.4.1
The hosted engine is stuck in the UEFI firmware trying to “never” boot.
I think this happened when I changed the default VM mode for the cluster inside the datacenter.
If you think this indeed is the root cause, then perhaps:
There’s a way to fix this without redeploying the engine?
If you happen to have backup copies of /var/run/ovirt-hosted-engine-ha/vm.conf , you can try: hosted-engine --vm-start --vm-conf=somefile If this works, update the VM/cluster/whatever back to a good state from the engine (after it's up), and wait to make sure it updated vm.conf on /var before you try to shutdown/start the engine vm again. Best regards, -- Didi

Hi Didi. On 25 Aug 2020, at 04:52, Yedidyah Bar David <didi@redhat.com<mailto:didi@redhat.com>> wrote: On Sun, Aug 23, 2020 at 7:45 AM Vinícius Ferrão via Users <users@ovirt.org<mailto:users@ovirt.org>> wrote: Hello, I’ve an strange issue with oVirt 4.4.1 The hosted engine is stuck in the UEFI firmware trying to “never” boot. I think this happened when I changed the default VM mode for the cluster inside the datacenter. If you think this indeed is the root cause, then perhaps: I double checked. By default the Hosted Engine does not set’s the BIOS type, it inherited from the Cluster: [cid:67300FB7-511F-41BB-8510-2A9E105B347D] I tried to manually set it up, but edits on Hosted Engine are locked. So may the default HE template should set this to avoid this mess? There’s a way to fix this without redeploying the engine? If you happen to have backup copies of /var/run/ovirt-hosted-engine-ha/vm.conf , you can try: hosted-engine --vm-start --vm-conf=somefile If this works, update the VM/cluster/whatever back to a good state from the engine (after it's up), and wait to make sure it updated vm.conf on /var before you try to shutdown/start the engine vm again. I ended up redeploying from the ground and reimporting back the VMs. I even tested again to confirm my first assumptions. So I think we may have a bug right? Thank you. Best regards, -- Didi

Thanks for diving into that mess first, because it allowed me to understand what I had done as well... In my case the issue was a VM moved from 4.3 to 4.4 seemed to be silently upgraded from "default" (whatever was default on 4.3) to "Q35", which seems to be the new default of 4.4. But that had it lose the network, because udev was now renaming the NIC in yet another manner, when few VMs ever need anything beyond eth0 anyway. So I went ahead and changed the cluster default to those of the 4.3 cluster (including Nehalem CPUs, because I also use J5005 Atom systems). BTW, that was initially impossible as the edit-button for the cluster ways always greyed out. But on a browser refresh, it suddenly was enabled... What I don't remember is if the cluster had a BIOS default (it doesn't on 4.3), or if I changed that in the default template, which is mentioned somewhere here as being rather distructive. I was about to re-import the machine from an export domain, when I did a scheduled reboot of the single node HCI cluster after OS updates. Those HCI reboots always require a bit ot twiddling on 4.3 and 4.4 for the hosted-engine to start, evidently because of some race conditions (requiring restarts of glusterd/ovirt-ha-broker/ovirt-ha-agent/vdsmd to fix), but this time the SHE simply didn't want to start at all, complaining about missing PCI devices at boot after some digging through log files. With my 4.4. instance currently dead I don't remember if the BIOS or PCI vs PCIe machine type is a cluster attribute or part of the template but I do seem to remember that the hosted-engine is a bit special here, especially when it comes to picking up the base CPU type. What is a bit astonishing is the fall-through processing that seems to go on here, when an existing VM should have its hardware nailed down when it was shut down. It then realized that I might have killed the hosted-engine right there. And no, /var/run/ovirt...vm.cfg is long gone and I guess it's time for a re-install. For me one issue remains unclear: How identical do machines remain as they are moved from a 4.3 host to a 4.4 host? In my view a hypervisor's most basic social contract is to turn a machine into a file and the file back into the very same machine, hopefully even for decades. Upgrade of the virtual hardware should be possible, but under controll of the user/orchestrator. I am afraid that oVirt's dynamic reconstruction of the machine from database data doesn't always respect that social contract and that needs at least documentation, if not fixing. The 4.3 to 4.4 migration is far from seamless already, this does not help.
participants (3)
-
thomas@hoberg.net
-
Vinícius Ferrão
-
Yedidyah Bar David