Certificate expiration w/o warning on all clients. Cluster in zombie state

Hello all, Even though I do my best to keep track of the certificate issue date across my different clusters, I somehow missed the vdsm certificate expiration in one of my clusters. Now I have an active cluster with multiple nodes (self-hosted / gluster storage), vdsm service is down on all nodes (due to certificate expiration) - hence, I cannot get the cluster into global maintenance mode (vdsms are down), and I cannot access my engine (to renew the engine certificates / re-enroll hosts). How can manual renew the host certificate? Thanks, Gilboa

On Sun, Dec 25, 2022 at 12:36 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello all,
Even though I do my best to keep track of the certificate issue date across my different clusters, I somehow missed the vdsm certificate expiration in one of my clusters. Now I have an active cluster with multiple nodes (self-hosted / gluster storage), vdsm service is down on all nodes (due to certificate expiration) - hence, I cannot get the cluster into global maintenance mode (vdsms are down), and I cannot access my engine (to renew the engine certificates / re-enroll hosts). How can manual renew the host certificate?
Thanks, Gilboa
P.S. CentOS 8 Streams engine and host, ovirt v4.5.3 (I think). - Gilboa

On Sun, Dec 25, 2022 at 12:37 PM Gilboa Davara <gilboad@gmail.com> wrote:
On Sun, Dec 25, 2022 at 12:36 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello all,
Even though I do my best to keep track of the certificate issue date across my different clusters, I somehow missed the vdsm certificate expiration in one of my clusters. Now I have an active cluster with multiple nodes (self-hosted / gluster storage), vdsm service is down on all nodes (due to certificate expiration) - hence, I cannot get the cluster into global maintenance mode (vdsms are down), and I cannot access my engine (to renew the engine certificates / re-enroll hosts). How can manual renew the host certificate?
Thanks, Gilboa
P.S. CentOS 8 Streams engine and host, ovirt v4.5.3 (I think).
- Gilboa
Managed to find an old email in this group (that I saved...) https://lists.ovirt.org/archives/list/users@ovirt.org/message/56QU2AD7YUX2VZ... This got the nodes working... but the engine (GRRR) still cannot connect to the nodes (I assume it has expired certs as well), hence, it cannot detect the cluster is in global maintenance mode, and cannot run engine-setup. Add issue https://github.com/oVirt/ovirt-engine/issues/784 - Gilboa

OK. Managed to get the engine up and running. But now it fails to communicate with the nodes :/ ... But at least I have an engine running... *** DISCLAIMER *** The following may eat your data, burn your house and possibly start WW3. Use it only if: A. This is the last ditch attempt to save your cluster. B. You feel brave. As this problem literally plagues every single ovirt user, I'm posting this in an effort to create a what-to-do-when-your-certs-expire handbook. Managed to get the engine and nodes up using a combination of data from 4 different sources. A. Create a new local CA following the instructions here: https://myhomelab.gr/linux/2019/12/13/local-ca-setup.html NOTE: You need to add "keyUsage = keyEncipherment, dataEncipherment, digitalSignature" to opensslsan.cnf. B. Use the newly created CA to generate (and deploy) apache.p12 cert(s), following the instructions here: https://myhomelab.gr/linux/2020/01/20/replacing_ovirt_ssl.html ... and here: https://rhv.bradmin.org/ovirt-engine/docs/Administration_Guide/appe-Red_Hat_... C. Rebuild the host certs using the instructions below: https://lists.ovirt.org/archives/list/users@ovirt.org/message/56QU2AD7YUX2VZ... Once you restart the engine and hosts services, I hosted-engine --vm-status between the hosts looks OK (all nodes are at 3400) and I can login into the engine. *However*, the engine still refuses to talk to the hosts, citing: 2022-12-26 08:53:14,727+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = gilboa-home-hv1-dev.localdomain, VdsIdAndVdsVDSCommandParametersBase:{hostId='43ddfcd5-4bd1-4731-bf30-4fedce22f3ab', vds='Host[gilboa-home-hv1-dev.localdomain,43ddfcd5-4bd1-4731-bf30-4fedce22f3ab]'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: SSL session is invalid 2022-12-26 08:53:17,744+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target 2022-12-26 08:53:17,748+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target 2022-12-26 08:53:18,187+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-96) [] Unable to RefreshCapabilities: ClientConnectionException: SSL session is invalid 2022-12-26 08:53:18,188+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-96) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = gilboa-home-hv1-dev.localdomain, VdsIdAndVdsVDSCommandParametersBase:{hostId='43ddfcd5-4bd1-4731-bf30-4fedce22f3ab', vds='Host[gilboa-home-hv1-dev.localdomain,43ddfcd5-4bd1-4731-bf30-4fedce22f3ab]'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: SSL session is invalid 2022-12-26 08:53:18,348+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM gilboa-home-hv2-srv.localdomain command Get Host Capabilities failed: Message timeout which can be caused by communication issues 2022-12-26 08:53:18,348+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues - Gilboa On Sun, Dec 25, 2022 at 5:13 PM Gilboa Davara <gilboad@gmail.com> wrote:
On Sun, Dec 25, 2022 at 12:37 PM Gilboa Davara <gilboad@gmail.com> wrote:
On Sun, Dec 25, 2022 at 12:36 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello all,
Even though I do my best to keep track of the certificate issue date across my different clusters, I somehow missed the vdsm certificate expiration in one of my clusters. Now I have an active cluster with multiple nodes (self-hosted / gluster storage), vdsm service is down on all nodes (due to certificate expiration) - hence, I cannot get the cluster into global maintenance mode (vdsms are down), and I cannot access my engine (to renew the engine certificates / re-enroll hosts). How can manual renew the host certificate?
Thanks, Gilboa
P.S. CentOS 8 Streams engine and host, ovirt v4.5.3 (I think).
- Gilboa
Managed to find an old email in this group (that I saved...)
https://lists.ovirt.org/archives/list/users@ovirt.org/message/56QU2AD7YUX2VZ...
This got the nodes working... but the engine (GRRR) still cannot connect to the nodes (I assume it has expired certs as well), hence, it cannot detect the cluster is in global maintenance mode, and cannot run engine-setup.
Add issue https://github.com/oVirt/ovirt-engine/issues/784
- Gilboa

Forgot to add: Re-enrol certificates fail, as the engine cannot connect to the nodes... - Gilboa On Mon, Dec 26, 2022 at 8:58 AM Gilboa Davara <gilboad@gmail.com> wrote:
OK.
Managed to get the engine up and running. But now it fails to communicate with the nodes :/ ... But at least I have an engine running...
*** DISCLAIMER *** The following may eat your data, burn your house and possibly start WW3. Use it only if: A. This is the last ditch attempt to save your cluster. B. You feel brave. As this problem literally plagues every single ovirt user, I'm posting this in an effort to create a what-to-do-when-your-certs-expire handbook.
Managed to get the engine and nodes up using a combination of data from 4 different sources. A. Create a new local CA following the instructions here: https://myhomelab.gr/linux/2019/12/13/local-ca-setup.html NOTE: You need to add "keyUsage = keyEncipherment, dataEncipherment, digitalSignature" to opensslsan.cnf. B. Use the newly created CA to generate (and deploy) apache.p12 cert(s), following the instructions here: https://myhomelab.gr/linux/2020/01/20/replacing_ovirt_ssl.html ... and here:
https://rhv.bradmin.org/ovirt-engine/docs/Administration_Guide/appe-Red_Hat_... C. Rebuild the host certs using the instructions below:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/56QU2AD7YUX2VZ...
Once you restart the engine and hosts services, I hosted-engine --vm-status between the hosts looks OK (all nodes are at 3400) and I can login into the engine. *However*, the engine still refuses to talk to the hosts, citing:
2022-12-26 08:53:14,727+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = gilboa-home-hv1-dev.localdomain, VdsIdAndVdsVDSCommandParametersBase:{hostId='43ddfcd5-4bd1-4731-bf30-4fedce22f3ab', vds='Host[gilboa-home-hv1-dev.localdomain,43ddfcd5-4bd1-4731-bf30-4fedce22f3ab]'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: SSL session is invalid 2022-12-26 08:53:17,744+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target 2022-12-26 08:53:17,748+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target 2022-12-26 08:53:18,187+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-96) [] Unable to RefreshCapabilities: ClientConnectionException: SSL session is invalid 2022-12-26 08:53:18,188+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-96) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = gilboa-home-hv1-dev.localdomain, VdsIdAndVdsVDSCommandParametersBase:{hostId='43ddfcd5-4bd1-4731-bf30-4fedce22f3ab', vds='Host[gilboa-home-hv1-dev.localdomain,43ddfcd5-4bd1-4731-bf30-4fedce22f3ab]'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: SSL session is invalid 2022-12-26 08:53:18,348+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM gilboa-home-hv2-srv.localdomain command Get Host Capabilities failed: Message timeout which can be caused by communication issues 2022-12-26 08:53:18,348+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
- Gilboa
On Sun, Dec 25, 2022 at 5:13 PM Gilboa Davara <gilboad@gmail.com> wrote:
On Sun, Dec 25, 2022 at 12:37 PM Gilboa Davara <gilboad@gmail.com> wrote:
On Sun, Dec 25, 2022 at 12:36 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello all,
Even though I do my best to keep track of the certificate issue date across my different clusters, I somehow missed the vdsm certificate expiration in one of my clusters. Now I have an active cluster with multiple nodes (self-hosted / gluster storage), vdsm service is down on all nodes (due to certificate expiration) - hence, I cannot get the cluster into global maintenance mode (vdsms are down), and I cannot access my engine (to renew the engine certificates / re-enroll hosts). How can manual renew the host certificate?
Thanks, Gilboa
P.S. CentOS 8 Streams engine and host, ovirt v4.5.3 (I think).
- Gilboa
Managed to find an old email in this group (that I saved...)
https://lists.ovirt.org/archives/list/users@ovirt.org/message/56QU2AD7YUX2VZ...
This got the nodes working... but the engine (GRRR) still cannot connect to the nodes (I assume it has expired certs as well), hence, it cannot detect the cluster is in global maintenance mode, and cannot run engine-setup.
Add issue https://github.com/oVirt/ovirt-engine/issues/784
- Gilboa

No worries, we call came across this issue. As long as the hosted engine is running is Gluster, you can shutdown and bring up in any other nodes. Now in order for you to bring the node up in the cluster, you will have to manually replace the vdsm cert in each nodes, follow by re-enroll the certificate the steps are # To check CERT expired # openssl x509 -in /etc/pki/vdsm/certs/vdsmcert.pem -noout -dates 1. Backup vdsm folder # cd /etc/pki # mv vdsm vdsm.orig # mkdir vdsm ; chown vdsm:kvm vdsm # cd vdsm # mkdir libvirt-vnc certs keys libvirt-spice libvirt-migrate # chown vdsm:kvm libvirt-vnc certs keys libvirt-spice libvirt-migrate 2. Regenerate cert & keys # vdsm-tool configure --module certificates 3. Copy the cert to destination location chmod 440 /etc/pki/vdsm/keys/vdsmkey.pem chown root /etc/pki/vdsmcerts/*pem chmod 644 /etc/pki/vdsmcerts/*pem cp /etc/pki/vdsm/certs/cacert.pem /etc/pki/vdsm/libvirt-spice/ca-cert.pem cp /etc/pki/vdsm/keys/vdsmkey.pem /etc/pki/vdsm/libvirt-spice/server-key.pem cp /etc/pki/vdsm/certs/vdsmcert.pem /etc/pki/vdsm/libvirt-spice/server-cert.pem cp /etc/pki/vdsm/certs/cacert.pem /etc/pki/vdsm/libvirt-vnc/ca-cert.pem cp /etc/pki/vdsm/keys/vdsmkey.pem /etc/pki/vdsm/libvirt-vnc/server-key.pem cp /etc/pki/vdsm/certs/vdsmcert.pem /etc/pki/vdsm/libvirt-vnc/server-cert.pem cp -p /etc/pki/vdsm/certs/cacert.pem /etc/pki/vdsm/libvirt-migrate/ca-cert.pem cp -p /etc/pki/vdsm/keys/vdsmkey.pem /etc/pki/vdsm/libvirt-migrate/server-key.pem cp -p /etc/pki/vdsm/certs/vdsmcert.pem /etc/pki/vdsm/libvirt-migrate/server-cert.pem chown root:qemu /etc/pki/vdsm/libvirt-migrate/server-key.pem cp -p /etc/pki/vdsm.orig/keys/libvirt_password /etc/pki/vdsm/keys/ mv /etc/pki/libvirt/clientcert.pem /etc/pki/libvirt/clientcert.pem.orig mv /etc/pki/libvirt/private/clientkey.pem /etc/pki/libvirt/private/clientkey.pem.orig mv /etc/pki/CA/cacert.pem /etc/pki/CA/cacert.pem.orig cp -p /etc/pki/vdsm/certs/vdsmcert.pem /etc/pki/libvirt/clientcert.pem cp -p /etc/pki/vdsm/keys/vdsmkey.pem /etc/pki/libvirt/private/clientkey.pem cp -p /etc/pki/vdsm/certs/cacert.pem /etc/pki/CA/cacert.pem 3. cross check the backup folder /etc/pki/vdsm.orig vs /etc/pki/vdsm # refer to /etc/pki/vdsm.orig/*/ and set the correct owner & group permission in /etc/pki/vdsm/*/ 4. restart services # Make sure both services are up systemctl restart vdsmd libvirtd 5. reboot the node and confirm the host has been rebooted manually, and put the host in maintenance mode 6. enroll certificate. (DO NOT re-install), exit the maintenance mode Cheers from Singapore.

One important note: ln -sf /etc/pki/vdsm/libvirt-vnc/server-key.pem /etc/pki/vdsm/libvirt-migrate/client-key.pem ln -sf /etc/pki/vdsm/libvirt-vnc/server-cert.pem /etc/pki/vdsm/libvirt-migrate/client-cert.pem Enrol will fail if client-*.pem doesn't exist and/or is not a symbolic link. - Gilboa On Tue, Dec 27, 2022 at 5:29 AM dhanaraj.ramesh--- via Users < users@ovirt.org> wrote:
No worries, we call came across this issue. As long as the hosted engine is running is Gluster, you can shutdown and bring up in any other nodes. Now in order for you to bring the node up in the cluster, you will have to manually replace the vdsm cert in each nodes, follow by re-enroll the certificate
the steps are
# To check CERT expired # openssl x509 -in /etc/pki/vdsm/certs/vdsmcert.pem -noout -dates
1. Backup vdsm folder # cd /etc/pki # mv vdsm vdsm.orig # mkdir vdsm ; chown vdsm:kvm vdsm # cd vdsm # mkdir libvirt-vnc certs keys libvirt-spice libvirt-migrate # chown vdsm:kvm libvirt-vnc certs keys libvirt-spice libvirt-migrate
2. Regenerate cert & keys # vdsm-tool configure --module certificates
3. Copy the cert to destination location chmod 440 /etc/pki/vdsm/keys/vdsmkey.pem chown root /etc/pki/vdsmcerts/*pem chmod 644 /etc/pki/vdsmcerts/*pem
cp /etc/pki/vdsm/certs/cacert.pem /etc/pki/vdsm/libvirt-spice/ca-cert.pem cp /etc/pki/vdsm/keys/vdsmkey.pem /etc/pki/vdsm/libvirt-spice/server-key.pem cp /etc/pki/vdsm/certs/vdsmcert.pem /etc/pki/vdsm/libvirt-spice/server-cert.pem
cp /etc/pki/vdsm/certs/cacert.pem /etc/pki/vdsm/libvirt-vnc/ca-cert.pem cp /etc/pki/vdsm/keys/vdsmkey.pem /etc/pki/vdsm/libvirt-vnc/server-key.pem cp /etc/pki/vdsm/certs/vdsmcert.pem /etc/pki/vdsm/libvirt-vnc/server-cert.pem
cp -p /etc/pki/vdsm/certs/cacert.pem /etc/pki/vdsm/libvirt-migrate/ca-cert.pem cp -p /etc/pki/vdsm/keys/vdsmkey.pem /etc/pki/vdsm/libvirt-migrate/server-key.pem cp -p /etc/pki/vdsm/certs/vdsmcert.pem /etc/pki/vdsm/libvirt-migrate/server-cert.pem
chown root:qemu /etc/pki/vdsm/libvirt-migrate/server-key.pem
cp -p /etc/pki/vdsm.orig/keys/libvirt_password /etc/pki/vdsm/keys/
mv /etc/pki/libvirt/clientcert.pem /etc/pki/libvirt/clientcert.pem.orig mv /etc/pki/libvirt/private/clientkey.pem /etc/pki/libvirt/private/clientkey.pem.orig mv /etc/pki/CA/cacert.pem /etc/pki/CA/cacert.pem.orig
cp -p /etc/pki/vdsm/certs/vdsmcert.pem /etc/pki/libvirt/clientcert.pem cp -p /etc/pki/vdsm/keys/vdsmkey.pem /etc/pki/libvirt/private/clientkey.pem cp -p /etc/pki/vdsm/certs/cacert.pem /etc/pki/CA/cacert.pem
3. cross check the backup folder /etc/pki/vdsm.orig vs /etc/pki/vdsm # refer to /etc/pki/vdsm.orig/*/ and set the correct owner & group permission in /etc/pki/vdsm/*/
4. restart services # Make sure both services are up systemctl restart vdsmd libvirtd
5. reboot the node and confirm the host has been rebooted manually, and put the host in maintenance mode
6. enroll certificate. (DO NOT re-install), exit the maintenance mode
Cheers from Singapore. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XWS5LKNFTLH2A4...

On Sun, Dec 25, 2022 at 5:15 PM Gilboa Davara <gilboad@gmail.com> wrote:
On Sun, Dec 25, 2022 at 12:37 PM Gilboa Davara <gilboad@gmail.com> wrote:
On Sun, Dec 25, 2022 at 12:36 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello all,
Even though I do my best to keep track of the certificate issue date across my different clusters, I somehow missed the vdsm certificate expiration in one of my clusters. Now I have an active cluster with multiple nodes (self-hosted / gluster storage), vdsm service is down on all nodes (due to certificate expiration) - hence, I cannot get the cluster into global maintenance mode (vdsms are down), and I cannot access my engine (to renew the engine certificates / re-enroll hosts). How can manual renew the host certificate?
Thanks, Gilboa
P.S. CentOS 8 Streams engine and host, ovirt v4.5.3 (I think).
- Gilboa
Managed to find an old email in this group (that I saved...) https://lists.ovirt.org/archives/list/users@ovirt.org/message/56QU2AD7YUX2VZ...
This got the nodes working... but the engine (GRRR) still cannot connect to the nodes (I assume it has expired certs as well), hence, it cannot detect the cluster is in global maintenance mode, and cannot run engine-setup.
Sorry, I do not follow. Is your immediate obstacle being that engine-setup refuses to continue, saying "Hosted Engine HA is in Global Maintenance mode."? You can cause it to ignore this test by passing 'OVESETUP_CONFIG/continueSetupOnHEVM=bool:True' (in the answer file or --otopi-environment). We recently added an option 'engine-setup --show-environment-documentation', exactly for this env key, see also: https://bugzilla.redhat.com/show_bug.cgi?id=1700460 Best regards, -- Didi

On Tue, Dec 27, 2022 at 8:39 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Sun, Dec 25, 2022 at 5:15 PM Gilboa Davara <gilboad@gmail.com> wrote:
On Sun, Dec 25, 2022 at 12:37 PM Gilboa Davara <gilboad@gmail.com> wrote:
On Sun, Dec 25, 2022 at 12:36 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello all,
Even though I do my best to keep track of the certificate issue date across my different clusters, I somehow missed the vdsm certificate expiration in one of my clusters. Now I have an active cluster with multiple nodes (self-hosted / gluster storage), vdsm service is down on all nodes (due to certificate expiration) - hence, I cannot get the cluster into global maintenance mode (vdsms are down), and I cannot access my engine (to renew the engine certificates / re-enroll hosts). How can manual renew the host certificate?
Thanks, Gilboa
P.S. CentOS 8 Streams engine and host, ovirt v4.5.3 (I think).
- Gilboa
Managed to find an old email in this group (that I saved...) https://lists.ovirt.org/archives/list/users@ovirt.org/message/56QU2AD7YUX2VZ...
This got the nodes working... but the engine (GRRR) still cannot connect to the nodes (I assume it has expired certs as well), hence, it cannot detect the cluster is in global maintenance mode, and cannot run engine-setup.
Sorry, I do not follow. Is your immediate obstacle being that engine-setup refuses to continue, saying "Hosted Engine HA is in Global Maintenance mode."?
You can cause it to ignore this test by passing 'OVESETUP_CONFIG/continueSetupOnHEVM=bool:True' (in the answer file or --otopi-environment).
We recently added an option 'engine-setup --show-environment-documentation', exactly for this env key, see also:
(BTW, I now see that I warned there against trying to parse the output, as it might change in the future - and that I indeed actually already "broke" it, https://github.com/oVirt/otopi/pull/22 . If anyone volunteers to enhance this - either add some override to otopi calling textwrap.wrap or perhaps some '--json' option or whatever, great!). -- Didi

Hello, On Tue, Dec 27, 2022 at 8:40 AM Yedidyah Bar David <didi@redhat.com> wrote:
Sorry, I do not follow. Is your immediate obstacle being that engine-setup refuses to continue, saying "Hosted Engine HA is in Global Maintenance mode."?
You can cause it to ignore this test by passing 'OVESETUP_CONFIG/continueSetupOnHEVM=bool:True' (in the answer file or --otopi-environment).
We recently added an option 'engine-setup --show-environment-documentation', exactly for this env key, see also:
https://bugzilla.redhat.com/show_bug.ccontinueSetupOnHEVM=bool:Truegi?id=170... <https://bugzilla.redhat.com/show_bug.cgi?id=1700460>
Best regards, -- Didi
I actually managed to bypass the check by editing he.py and deleting the "raise" statement, preventing hosted-engine from bombing out because it wasn't able to connect to the nodes. From there I managed to renew the certificates (see second mail), and even connected two of the 3 nodes successfully (I had to create new temporary vdsm certificates, get them semi-connected to the engine, and then "re-enroll certificates" from the UI. Once I had a limping cluster up, I shut everything down cleanly, and... and redeployed the cluster from scratch. (with all the failed attempts, my HE was completely busted). That said, I wonder if having to short circuit the environment variable isn't a bit over-complicated, given the considerable number of cert related issues. But thanks for the heads-up. Q: I'm willing to try and document all the steps I did, in my semi-success attempt to save my cluster. That said, I rather not document wrong / broken steps. Can anyone @RH review my writeup? - Gilboa

On Tue, Dec 27, 2022 at 6:18 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello,
On Tue, Dec 27, 2022 at 8:40 AM Yedidyah Bar David <didi@redhat.com> wrote:
Sorry, I do not follow. Is your immediate obstacle being that engine-setup refuses to continue, saying "Hosted Engine HA is in Global Maintenance mode."?
You can cause it to ignore this test by passing 'OVESETUP_CONFIG/continueSetupOnHEVM=bool:True' (in the answer file or --otopi-environment).
We recently added an option 'engine-setup --show-environment-documentation', exactly for this env key, see also:
https://bugzilla.redhat.com/show_bug.ccontinueSetupOnHEVM=bool:Truegi?id=170...
Best regards, -- Didi
I actually managed to bypass the check by editing he.py and deleting the "raise" statement, preventing hosted-engine from bombing out because it wasn't able to connect to the nodes. From there I managed to renew the certificates (see second mail), and even connected two of the 3 nodes successfully (I had to create new temporary vdsm certificates, get them semi-connected to the engine, and then "re-enroll certificates" from the UI. Once I had a limping cluster up, I shut everything down cleanly, and... and redeployed the cluster from scratch. (with all the failed attempts, my HE was completely busted). That said, I wonder if having to short circuit the environment variable isn't a bit over-complicated, given the considerable number of cert related issues.
I do not think it's "over complicated" in any technical sense - just one command line to copy/paste from somewhere. I'd say it's mainly that knowing that this is the solution to your exact problem is the hard thing.
But thanks for the heads-up.
Q: I'm willing to try and document all the steps I did, in my semi-success attempt to save my cluster.
I think that would be great.
That said, I rather not document wrong / broken steps. Can anyone @RH review my writeup?
Sure! But consider how you intend to publish it. If as something like a blog post (on ovirt.org or your own blog or whatever), that's less "authoritative" and understandably more local/specific. If you consider integrating it into the official guides, that's more delicate. -- Didi
participants (3)
-
dhanaraj.ramesh@yahoo.com
-
Gilboa Davara
-
Yedidyah Bar David