
Hello, On Mon, Feb 7, 2022 at 2:25 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Mon, Feb 7, 2022 at 1:27 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello,
On Mon, Feb 7, 2022 at 8:45 AM Yedidyah Bar David <didi@redhat.com>
wrote:
On Sun, Feb 6, 2022 at 5:09 PM Gilboa Davara <gilboad@gmail.com> wrote:
Unlike my predecessor, I not only lost my vmengine, I also lost the
vdsm services on all hosts.
All seem to be hitting the same issue - read, the certs under /etc/pki/vdsm/certs and /etc/pki/ovirt* all expired a couple of days ago. As such, the hosted engine cannot go into global maintenance mode,
What do you mean by that? What happens if you 'hosted-engine --set-maintenance --mode=global'?
Failed, stating the cluster is not in global maintenance mode.
Please clarify, and/or share relevant logs, if you have them.
Sadly enough, no. When I zapped the old vmegine and hosts configuration, I forgot to save the logs. (In my defense, it was 4am...) That said, the fix proposed in BZ#1700460 (Let the user skip the global maintenance check) might have saved my cluster.
You had a semi-working existing HE cluster. You ran engine-backup on it, took a backup, while it was _not_ in global maintenance.
It was rather odd. One of the hosts was still active and running the HE engine. After I updated the apache certs, I could connect to the WebUI, but the WebUI failed to access the nodes, spewing SSL handshake errors. I then processed to replace the hosts certs, which seems to work, (E.g. vdsm-client Host getCapabilities worked), hosted-engine --vm-status worked and I could see all 3 hosts, but the engine failed to communicate with the hosts, hence, even though I had a working cluster and engine, and I could get the cluster into global maintenance mode, engine-setup --offline continued to spew "not-in-global-maintenance-mode' errors. At this stage I decided to simply zap the hosted engine and ovirt-hosted-engine-cleanup the hosts. As my brain was half dead, I decided to do a fresh deployment, and not use the daily backup.
That's ok and expected.
Then you took one of the hosts and evacuated it (or just a new one), (re)installed the OS (or somehow cleaned it up), and ran 'hosted-engine --deploy --import-from-file' with the backup you took. This failed? Where exactly and with what error?
Didn't use the backup. Clean hosted-engine --deploy failed due to qemu-6.1 failure. (I believe it's a known BZ#). Once I remembered to downgrade it to 6.0, everything worked as advertised (minus one export domain, see another email).
If it's the engine-setup running inside the engine VM, with the same error as when running 'engine-setup' (perhaps with --offline) manually, then this shouldn't happen at this point: - engine-backup --mode=restore sets vdc option in the db 'DbJustRestored' - engine-setup checks this and sets its own env[JUST_RESTORED] accordingly
(Understandable, given two of 3 hosts were offline due to certificate issues...)
preventing engine-setup --offline from running.
Actually just a few days ago I pushed a patch for:
https://bugzilla.redhat.com/show_bug.cgi?id=1700460
But:
If you really have a problem that you can't set global maintenance, using this is a risk - HA might intervene in the middle and shutdown the VM. So either make sure global maintenance does work, or stop all HA services on all hosts.
Two questions: 1. Is there any automated method to renew the vdsm certificates?
You mean, without an engine?
I think that if you have a functional engine one way or another, you can automate this somehow, didn't check. Try checking e.g. the python sdk examples - there might be there something you can base on.
2. Assuming the previous answer is "no", assuming I'm somewhat versed
in using openssl, how can I manually renew them?
I'd rather not try to invent from memory how this is supposed to work, and doing this methodically and verifying before replying is quite an effort.
If this is really what you want, I suggest something like:
1. Set up a test env with an engine and one host 2. Backup (or use git on) /etc on both 3. Renew the host cert from the UI 4. Check what changed
You should find, IMO, that the key(s) on the host didn't change. I guess you might also find CSRs on one or both of them. So basically it should be something like: 1. Create a CSR on the host for the existing key (one or more, not sure). 2. Copy and sign this on the engine using pki-enroll-request.sh (I think you can find examples for it scattered around, perhaps even in the main guides) 3. Copy back the generated certs to the host 4. Perhaps restart one or more services there (vdsm, imageio?, ovn, etc.)
You can check the code in /usr/share/ovirt-engine/ansible-runner-service-project/project to see how it's done when initiated from the UI.
Good luck and best regards,
I more of less found a document stating the above somewhere in the middle of the night. Tried it. Got the WebUI working again. However, for the life of me I couldn't get the hosts to work to talk to the engine. (Even though I could use openssl s_client -showcerts -connect host and got valid certs). In the end, @around ~4am, I decided to take the brute force route, clean the hosts, upgrade them to -streams, and redeploy the engine again (3'rd attempt, after sufficient amount of coffee reminded me the qemu-6.1 is broken, and needed to be downgraded before trying to deploy the HE...). Either way, when I finish importing the VMs, I'll open a RFE to add BIG-WARNING-IN-BOLD-LETTERS in the WebUI to notify the admin that the certificates are about to expire.
You should have already received them, no?
https://bugzilla.redhat.com/show_bug.cgi?id=1258585
Best regards, -- Didi