Hosted engine restore went very wrong

I really feel like an idiot. I tried to move our hosted engine from our Default datacenter to our Ceph Datacenter. I ran intro problems which were correctly addressed. see: https://lists.ovirt.org/archives/list/users@ovirt.org/thread/ZFCLFWRN6XR6KMH... It was in case a race condition. I was able to bring back the engine to our Default Cluster. And then I tried to do the move again to our Ceph Datacenter. I got the error "The target Data Center does not contain the Virtual Disk" twice yesterday. Because it was late, I decided to do it in the next morning. I did a new backup from the engine. Copied it over to the new node of the Ceph Datacenter and started the hosted-engine --deploy. But I FORGET TO SHUTDOWN the other engine! Oh man. The deploy script errored out with: [ ERROR ] fatal: [localhost]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} [ ERROR ] fatal: [localhost -> engine.infra.solutions.work]: FAILED! => {"changed": false, "msg": "There was a failure deploying the engine on the local engine VM. The system may not be provisioned accord Then I realised something was different this time. I shutdown and undefined the Local Engine. The node is now in a degraded state. Is it possible to start the deployment again on a degraded node? I started the old engine again, but I'm not able to reach the login page. Any Idea what to do next?

First I rebooted the hosted I tried to deploy to. The node status is now "OK" again. I will try to deploy on that node again.

Deployment looked good, but the race condition (which I faced 5 times out of seven deployments) is back again. [ INFO ] TASK [ovirt.hosted_engine_setup : Add HE disks] [ INFO ] changed: [localhost] [ INFO ] TASK [ovirt.hosted_engine_setup : Register disk details] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.hosted_engine_setup : Set default graphics protocols] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.hosted_engine_setup : Check if FIPS is enabled] [ INFO ] changed: [localhost] [ INFO ] TASK [ovirt.hosted_engine_setup : Select graphic protocols] [ INFO ] skipping: [localhost] [ INFO ] TASK [ovirt.hosted_engine_setup : Add VM] [ ERROR ] Error: Fault reason is "Operation Failed". Fault detail is "[Cannot attach Virtual Disk. The target Data Center does not contain the Virtual Disk.]". HTTP response code is 409. [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Cannot attach Virtual Disk. The target Data Center does not contain the Virtual Disk.]\". HTTP response code is 409."} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook [ INFO ] Stage: Clean up [ INFO ] Cleaning temporary resources [ INFO ] TASK [ovirt.hosted_engine_setup : Execute just a specific set of steps] [ INFO ] ok: [localhost] Will try again, but yesterday I got this error two times in a row.

The second deployment still triggers "The target Data Center does not contain the Virtual Disk". Trying again. This time deleting the old data in the gluster volume. Changing number of engine CPUs from 4 to 2. Hope this gives the race condition some different gravity.

Still no success. The ansible script errors out again at "The target Data Center does not contain the Virtual Disk". trying a fourth time.

Meanwhile I was able to remove the old wrecked engine with the LocalHostedEngine on Port 6900. It was stable enough to re-install that node with engine undeployment selected. So I'm now able to bring back the engine to the old Default Datacenter and NFS.
participants (1)
-
Andreas Elvers