Hello,

On Mon, Feb 7, 2022 at 8:45 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Sun, Feb 6, 2022 at 5:09 PM Gilboa Davara <gilboad@gmail.com> wrote:
>
> Unlike my predecessor, I not only lost my vmengine, I also lost the vdsm services on all hosts.
> All seem to be hitting the same issue - read, the certs under  /etc/pki/vdsm/certs and /etc/pki/ovirt* all expired a couple of days ago.
> As such, the hosted engine cannot go into global maintenance mode,

What do you mean by that? What happens if you 'hosted-engine
--set-maintenance --mode=global'?

Failed, stating the cluster is not in global maintenance mode.
(Understandable, given two of 3 hosts were offline due to certificate issues...)

 

> preventing engine-setup --offline from running.

Actually just a few days ago I pushed a patch for:

https://bugzilla.redhat.com/show_bug.cgi?id=1700460

But:

If you really have a problem that you can't set global maintenance,
using this is a risk - HA might intervene in the middle and shutdown
the VM. So either make sure global maintenance does work, or stop
all HA services on all hosts.

> Two questions:
> 1. Is there any automated method to renew the vdsm certificates?

You mean, without an engine?

I think that if you have a functional engine one way or another,
you can automate this somehow, didn't check. Try checking e.g. the
python sdk examples - there might be there something you can base
on.

> 2. Assuming the previous answer is "no", assuming I'm somewhat versed in using openssl, how can I manually renew them?

I'd rather not try to invent from memory how this is supposed to work,
and doing this methodically and verifying before replying is quite
an effort.

If this is really what you want, I suggest something like:

1. Set up a test env with an engine and one host
2. Backup (or use git on) /etc on both
3. Renew the host cert from the UI
4. Check what changed

You should find, IMO, that the key(s) on the host didn't
change. I guess you might also find CSRs on one or both of them.
So basically it should be something like:
1. Create a CSR on the host for the existing key (one or more,
not sure).
2. Copy and sign this on the engine using pki-enroll-request.sh
(I think you can find examples for it scattered around, perhaps
even in the main guides)
3. Copy back the generated certs to the host
4. Perhaps restart one or more services there (vdsm, imageio?,
ovn, etc.)

You can check the code in
/usr/share/ovirt-engine/ansible-runner-service-project/project
to see how it's done when initiated from the UI.

Good luck and best regards,

I more of less found a document stating the above somewhere in the middle of the night.
Tried it.
Got the WebUI working again.
However, for the life of me I couldn't get the hosts to work to talk to the engine. (Even though I could use openssl s_client -showcerts -connect host and got valid certs).
In the end, @around ~4am, I decided to take the brute force route, clean the hosts, upgrade them to -streams, and redeploy the engine again (3'rd attempt, after sufficient amount of coffee reminded me the qemu-6.1 is broken, and needed to be downgraded before trying to deploy the HE...).
Either way, when I finish importing the VMs, I'll open a RFE to add BIG-WARNING-IN-BOLD-LETTERS in the WebUI to notify the admin that the certificates are about to expire.

Thanks for the help!

- Gilboa

 
--
Didi