Certificate expiration

newer
Hosted-Engine VM wont start after...

Joseph Gelinas

18 Feb 2022 18 Feb '22

5:33 p.m.

Hi, The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity. VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive. I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place. https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm... https://access.redhat.com/solutions/2409751 I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now? Appreciate any ideas.

Show replies by date

Strahil Nikolov

18 Feb 18 Feb

8:20 p.m.

To set the engine into maintenance mode you can ssh to any Hypervisor and run:'hosted-engine --set-maintenance --mode=global'wait 1 minute and run 'hosted-engine --vm-status' to validate. Best Regards,Strahil Nikolov On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas<joseph@gelinas.cc> wrote: Hi, The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity. VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive. I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place. https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm... https://access.redhat.com/solutions/2409751 I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now? Appreciate any ideas. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Joseph Gelinas

8:44 p.m.

I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data. [root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs. [root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs. [root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status !! Cluster is in GLOBAL MAINTENANCE mode !! --== Host ovirt-1.xxxxxx.com (id: 1) status ==-- Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True --== Host ovirt-3.xxxxxx.com (id: 2) status ==-- Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True --== Host ovirt-2.xxxxxx.com (id: 3) status ==-- Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False !! Cluster is in GLOBAL MAINTENANCE mode !! Ovirt-ha-agent on 1&3 just keeps trying to restart: MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday. MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400) Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...

On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Strahil Nikolov

10:35 p.m.

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown.Try to start ovirt-ha-broker & ovirt-ha-agent Also, you may try to move the hosted-engine to ovirt-2 and try again Best Regards,Strahil Nikolov On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas<joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data. [root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs. [root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs. [root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status !! Cluster is in GLOBAL MAINTENANCE mode !! --== Host ovirt-1.xxxxxx.com (id: 1) status ==-- Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True --== Host ovirt-3.xxxxxx.com (id: 2) status ==-- Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True --== Host ovirt-2.xxxxxx.com (id: 3) status ==-- Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False !! Cluster is in GLOBAL MAINTENANCE mode !! Ovirt-ha-agent on 1&3 just keeps trying to restart: MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday. MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400) Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...

On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

Joseph Gelinas

11:21 p.m.

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3). The output for broker.log: MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds vdsm.log: 2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...

On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

Strahil Nikolov

19 Feb 19 Feb

1:11 a.m.

Based on this one, it looks like thereis a way to init via sanlock (https://lists.ovirt.org/pipermail/users/2016-June/073652.html) It's something like this: sanlock direct init -s <sd_uuid>:0:/rhev/mnt/glusterSD/<host>_<volume>/sd_uuid/dom_md/ids:0sd_uuid should be the directory (gluster) in /rhev/mnt/glusterSD/host_volumeRead the whole topic before taking your actions. As far as I know nothing should be using the storage domain (stop ovirt-ha-broker & ovirt-ha-agent on all nodes) Best Regards,Strahil Nikolov On Sat, Feb 19, 2022 at 0:25, Joseph Gelinas<joseph@gelinas.cc> wrote: Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3). The output for broker.log: MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds vdsm.log: 2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...

On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

Strahil Nikolov

1:12 a.m.

Disregard my previous e-mail... it's for another topic On Sat, Feb 19, 2022 at 2:11, Strahil Nikolov<hunter86_bg@yahoo.com> wrote: Based on this one, it looks like thereis a way to init via sanlock (https://lists.ovirt.org/pipermail/users/2016-June/073652.html) It's something like this: sanlock direct init -s <sd_uuid>:0:/rhev/mnt/glusterSD/<host>_<volume>/sd_uuid/dom_md/ids:0sd_uuid should be the directory (gluster) in /rhev/mnt/glusterSD/host_volumeRead the whole topic before taking your actions. As far as I know nothing should be using the storage domain (stop ovirt-ha-broker & ovirt-ha-agent on all nodes) Best Regards,Strahil Nikolov On Sat, Feb 19, 2022 at 0:25, Joseph Gelinas<joseph@gelinas.cc> wrote: Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3). The output for broker.log: MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds vdsm.log: 2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...

On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

Joseph Gelinas

8:43 a.m.

I believe I ran `hosted-engine --deploy` on ovirt-1 to see if there was an option to reenroll that way, but when it prompted and asked if it was really what I wanted to do I ctrl-D or said no and it ran something anyways, so I ctrl-C out of it and maybe that is what messed up vdsm on that node. Not sure about ovirt-3, is there a way to fix that?

...

On Feb 18, 2022, at 17:21, Joseph Gelinas <joseph@gelinas.cc> wrote:

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3).

The output for broker.log:

MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

vdsm.log:

2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...
On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

Strahil Nikolov

8:55 p.m.

Is your issue with the host certificates or the engine ? You can try to set a node in maintenance (or at least try that) and then try to reenroll the certificate from the UI. Best Regards,Strahil Nikolov On Sat, Feb 19, 2022 at 9:48, Joseph Gelinas<joseph@gelinas.cc> wrote: I believe I ran `hosted-engine --deploy` on ovirt-1 to see if there was an option to reenroll that way, but when it prompted and asked if it was really what I wanted to do I ctrl-D or said no and it ran something anyways, so I ctrl-C out of it and maybe that is what messed up vdsm on that node. Not sure about ovirt-3, is there a way to fix that?

...

On Feb 18, 2022, at 17:21, Joseph Gelinas <joseph@gelinas.cc> wrote:

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3).

The output for broker.log:

MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

vdsm.log:

2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...
On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/477NW53FXLCUFG...

Joseph Gelinas

20 Feb 20 Feb

6:59 a.m.

Both I guess. The host certificates expired on the 15th the console expires on the 23. Right now since the engine sees the hosts as unassigned I don't get the option to set hosts to maintenance mode and if I try to set Enable Global Maintenance I get the message: "Cannot edit VM Cluster. Operation can be performed only when Hoist status is Up."

...

On Feb 19, 2022, at 14:55, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Is your issue with the host certificates or the engine ?

You can try to set a node in maintenance (or at least try that) and then try to reenroll the certificate from the UI.

Best Regards, Strahil Nikolov

On Sat, Feb 19, 2022 at 9:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I believe I ran `hosted-engine --deploy` on ovirt-1 to see if there was an option to reenroll that way, but when it prompted and asked if it was really what I wanted to do I ctrl-D or said no and it ran something anyways, so I ctrl-C out of it and maybe that is what messed up vdsm on that node. Not sure about ovirt-3, is there a way to fix that?

...
On Feb 18, 2022, at 17:21, Joseph Gelinas <joseph@gelinas.cc> wrote:

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3).

The output for broker.log:

MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

vdsm.log:

2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...
On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/477NW53FXLCUFG...

Strahil Nikolov

1:52 p.m.

Do you have the option to use 'Install' -> enroll certificate (or whatever is the entry in UI ) ? Best Regards,Strahil Nikolov On Sun, Feb 20, 2022 at 8:05, Joseph Gelinas<joseph@gelinas.cc> wrote: Both I guess. The host certificates expired on the 15th the console expires on the 23. Right now since the engine sees the hosts as unassigned I don't get the option to set hosts to maintenance mode and if I try to set Enable Global Maintenance I get the message: "Cannot edit VM Cluster. Operation can be performed only when Hoist status is Up."

...

On Feb 19, 2022, at 14:55, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Is your issue with the host certificates or the engine ?

You can try to set a node in maintenance (or at least try that) and then try to reenroll the certificate from the UI.

Best Regards, Strahil Nikolov

On Sat, Feb 19, 2022 at 9:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I believe I ran `hosted-engine --deploy` on ovirt-1 to see if there was an option to reenroll that way, but when it prompted and asked if it was really what I wanted to do I ctrl-D or said no and it ran something anyways, so I ctrl-C out of it and maybe that is what messed up vdsm on that node. Not sure about ovirt-3, is there a way to fix that?

...
On Feb 18, 2022, at 17:21, Joseph Gelinas <joseph@gelinas.cc> wrote:

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3).

The output for broker.log:

MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

vdsm.log:

2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...
On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/477NW53FXLCUFG...

Joseph Gelinas

2:04 p.m.

Right, I don't have those options, because the hosts are listed as unassigned. I can't migrate the engine. I can't put anything into maintenance so the installation menu becomes available.

...

On Feb 20, 2022, at 07:52, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Do you have the option to use 'Install' -> enroll certificate (or whatever is the entry in UI ) ?

Best Regards, Strahil Nikolov

On Sun, Feb 20, 2022 at 8:05, Joseph Gelinas <joseph@gelinas.cc> wrote: Both I guess. The host certificates expired on the 15th the console expires on the 23. Right now since the engine sees the hosts as unassigned I don't get the option to set hosts to maintenance mode and if I try to set Enable Global Maintenance I get the message: "Cannot edit VM Cluster. Operation can be performed only when Hoist status is Up."

...
On Feb 19, 2022, at 14:55, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Is your issue with the host certificates or the engine ?

You can try to set a node in maintenance (or at least try that) and then try to reenroll the certificate from the UI.

Best Regards, Strahil Nikolov

On Sat, Feb 19, 2022 at 9:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I believe I ran `hosted-engine --deploy` on ovirt-1 to see if there was an option to reenroll that way, but when it prompted and asked if it was really what I wanted to do I ctrl-D or said no and it ran something anyways, so I ctrl-C out of it and maybe that is what messed up vdsm on that node. Not sure about ovirt-3, is there a way to fix that?

...
On Feb 18, 2022, at 17:21, Joseph Gelinas <joseph@gelinas.cc> wrote:

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3).

The output for broker.log:

MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

vdsm.log:

2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...
On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/477NW53FXLCUFG...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/TSBVZC37VMVHEE...

Joseph Gelinas

2:56 p.m.

No. I don't have any of the options under Installation.

...

On Feb 20, 2022, at 07:52, Strahil Nikolov via Users <users@ovirt.org> wrote:

Do you have the option to use 'Install' -> enroll certificate (or whatever is the entry in UI ) ?

Best Regards, Strahil Nikolov

On Sun, Feb 20, 2022 at 8:05, Joseph Gelinas <joseph@gelinas.cc> wrote: Both I guess. The host certificates expired on the 15th the console expires on the 23. Right now since the engine sees the hosts as unassigned I don't get the option to set hosts to maintenance mode and if I try to set Enable Global Maintenance I get the message: "Cannot edit VM Cluster. Operation can be performed only when Hoist status is Up."

...
On Feb 19, 2022, at 14:55, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Is your issue with the host certificates or the engine ?

You can try to set a node in maintenance (or at least try that) and then try to reenroll the certificate from the UI.

Best Regards, Strahil Nikolov

On Sat, Feb 19, 2022 at 9:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I believe I ran `hosted-engine --deploy` on ovirt-1 to see if there was an option to reenroll that way, but when it prompted and asked if it was really what I wanted to do I ctrl-D or said no and it ran something anyways, so I ctrl-C out of it and maybe that is what messed up vdsm on that node. Not sure about ovirt-3, is there a way to fix that?

...
On Feb 18, 2022, at 17:21, Joseph Gelinas <joseph@gelinas.cc> wrote:

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3).

The output for broker.log:

MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

vdsm.log:

2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...
On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/477NW53FXLCUFG...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/TSBVZC37VMVHEE...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OP2TEXM4C56UBA...

Strahil Nikolov

21 Feb 21 Feb

7:01 a.m.

Take a backup of the engine, if you haven't done so far. Then with the virsh alias try to migrare:ssh root@<target_host> 'uptime'virsh migrate --live HostedEngine qemu+ssh://<target_host>/system Best Regards,Strahil Nikolov On Sun, Feb 20, 2022 at 17:18, Joseph Gelinas<joseph@gelinas.cc> wrote: No. I don't have any of the options under Installation.

...

On Feb 20, 2022, at 07:52, Strahil Nikolov via Users <users@ovirt.org> wrote:

Do you have the option to use 'Install' -> enroll certificate (or whatever is the entry in UI ) ?

Best Regards, Strahil Nikolov

On Sun, Feb 20, 2022 at 8:05, Joseph Gelinas <joseph@gelinas.cc> wrote: Both I guess. The host certificates expired on the 15th the console expires on the 23. Right now since the engine sees the hosts as unassigned I don't get the option to set hosts to maintenance mode and if I try to set Enable Global Maintenance I get the message: "Cannot edit VM Cluster. Operation can be performed only when Hoist status is Up."

...
On Feb 19, 2022, at 14:55, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Is your issue with the host certificates or the engine ?

You can try to set a node in maintenance (or at least try that) and then try to reenroll the certificate from the UI.

Best Regards, Strahil Nikolov

On Sat, Feb 19, 2022 at 9:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I believe I ran `hosted-engine --deploy` on ovirt-1 to see if there was an option to reenroll that way, but when it prompted and asked if it was really what I wanted to do I ctrl-D or said no and it ran something anyways, so I ctrl-C out of it and maybe that is what messed up vdsm on that node. Not sure about ovirt-3, is there a way to fix that?

...
On Feb 18, 2022, at 17:21, Joseph Gelinas <joseph@gelinas.cc> wrote:

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3).

The output for broker.log:

MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

vdsm.log:

2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...
On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/477NW53FXLCUFG...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/TSBVZC37VMVHEE...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OP2TEXM4C56UBA...

Joseph Gelinas

4 Mar 4 Mar

2:19 a.m.

We got the engine moved to ovirt-2 and were able to resolve some of the certificate errors, however we are still in a state were two of the hosts are status unassigned and can not put them into maintenance mode or other wise do anything with them via the ovirt engine web interface. We are seeing issues in vdsm.log, agent.log, broker.log on the two host in question, output attached. Appreciate any ideas on what to do next.

...

On Feb 21, 2022, at 01:01, Strahil Nikolov via Users <users@ovirt.org> wrote:

Take a backup of the engine, if you haven't done so far.

Then with the virsh alias try to migrare: ssh root@<target_host> 'uptime' virsh migrate --live HostedEngine qemu+ssh://<target_host>/system

Best Regards, Strahil Nikolov

On Sun, Feb 20, 2022 at 17:18, Joseph Gelinas <joseph@gelinas.cc> wrote: No. I don't have any of the options under Installation.

...
On Feb 20, 2022, at 07:52, Strahil Nikolov via Users <users@ovirt.org> wrote:

Do you have the option to use 'Install' -> enroll certificate (or whatever is the entry in UI ) ?

Best Regards, Strahil Nikolov

On Sun, Feb 20, 2022 at 8:05, Joseph Gelinas <joseph@gelinas.cc> wrote: Both I guess. The host certificates expired on the 15th the console expires on the 23. Right now since the engine sees the hosts as unassigned I don't get the option to set hosts to maintenance mode and if I try to set Enable Global Maintenance I get the message: "Cannot edit VM Cluster. Operation can be performed only when Hoist status is Up."

...
On Feb 19, 2022, at 14:55, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Is your issue with the host certificates or the engine ?

You can try to set a node in maintenance (or at least try that) and then try to reenroll the certificate from the UI.

Best Regards, Strahil Nikolov

On Sat, Feb 19, 2022 at 9:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I believe I ran `hosted-engine --deploy` on ovirt-1 to see if there was an option to reenroll that way, but when it prompted and asked if it was really what I wanted to do I ctrl-D or said no and it ran something anyways, so I ctrl-C out of it and maybe that is what messed up vdsm on that node. Not sure about ovirt-3, is there a way to fix that?

...
On Feb 18, 2022, at 17:21, Joseph Gelinas <joseph@gelinas.cc> wrote:

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3).

The output for broker.log:

MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

vdsm.log:

2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...
On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.xxxxxx.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/477NW53FXLCUFG...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/TSBVZC37VMVHEE...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OP2TEXM4C56UBA...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LVL3P4HBMU4SUM...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/HUBOB37J63XVGB...

Strahil Nikolov

5 Mar 5 Mar

8:11 a.m.

Can you get a downtime for a host ?If yes, you can shutdown the VMs from the 'virsh' ( or migrate them away ) and then from the UI mark that the host was rebooted .Then oVirt should allow you to set the host to maintenance. Best Regards,Strahil Nikolov On Fri, Mar 4, 2022 at 3:35, Joseph Gelinas<joseph@gelinas.cc> wrote: We got the engine moved to ovirt-2 and were able to resolve some of the certificate errors, however we are still in a state were two of the hosts are status unassigned and can not put them into maintenance mode or other wise do anything with them via the ovirt engine web interface. We are seeing issues in vdsm.log, agent.log, broker.log on the two host in question, output attached. Appreciate any ideas on what to do next.

Strahil Nikolov

20 Feb 20 Feb

1:56 p.m.

Did you manage to move the engine VM to the only node that's in global maintenance ? Best Regards,Strahil Nikolov On Sun, Feb 20, 2022 at 8:05, Joseph Gelinas<joseph@gelinas.cc> wrote: Both I guess. The host certificates expired on the 15th the console expires on the 23. Right now since the engine sees the hosts as unassigned I don't get the option to set hosts to maintenance mode and if I try to set Enable Global Maintenance I get the message: "Cannot edit VM Cluster. Operation can be performed only when Hoist status is Up."

...

On Feb 19, 2022, at 14:55, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Is your issue with the host certificates or the engine ?

You can try to set a node in maintenance (or at least try that) and then try to reenroll the certificate from the UI.

Best Regards, Strahil Nikolov

On Sat, Feb 19, 2022 at 9:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I believe I ran `hosted-engine --deploy` on ovirt-1 to see if there was an option to reenroll that way, but when it prompted and asked if it was really what I wanted to do I ctrl-D or said no and it ran something anyways, so I ctrl-C out of it and maybe that is what messed up vdsm on that node. Not sure about ovirt-3, is there a way to fix that?

...
On Feb 18, 2022, at 17:21, Joseph Gelinas <joseph@gelinas.cc> wrote:

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3).

The output for broker.log:

MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

vdsm.log:

2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...
On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/477NW53FXLCUFG...

Joseph Gelinas

2:26 p.m.

Is there a way to do so without the web frontend? As I don't have option to migrate it.

...

On Feb 20, 2022, at 07:56, Strahil Nikolov via Users <users@ovirt.org> wrote:

Did you manage to move the engine VM to the only node that's in global maintenance ?

Best Regards, Strahil Nikolov

On Sun, Feb 20, 2022 at 8:05, Joseph Gelinas <joseph@gelinas.cc> wrote: Both I guess. The host certificates expired on the 15th the console expires on the 23. Right now since the engine sees the hosts as unassigned I don't get the option to set hosts to maintenance mode and if I try to set Enable Global Maintenance I get the message: "Cannot edit VM Cluster. Operation can be performed only when Hoist status is Up."

...
On Feb 19, 2022, at 14:55, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Is your issue with the host certificates or the engine ?

You can try to set a node in maintenance (or at least try that) and then try to reenroll the certificate from the UI.

Best Regards, Strahil Nikolov

On Sat, Feb 19, 2022 at 9:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I believe I ran `hosted-engine --deploy` on ovirt-1 to see if there was an option to reenroll that way, but when it prompted and asked if it was really what I wanted to do I ctrl-D or said no and it ran something anyways, so I ctrl-C out of it and maybe that is what messed up vdsm on that node. Not sure about ovirt-3, is there a way to fix that?

...
On Feb 18, 2022, at 17:21, Joseph Gelinas <joseph@gelinas.cc> wrote:

Unfortunately ovirt-ha-broker & ovirt-ha-agent are just in continual restart loops on ovirt-1 & ovirt-3 (ovirt-engine is currently on ovirt-3).

The output for broker.log:

MainThread::ERROR::2022-02-18 22:08:58,101::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2022-02-18 22:08:58,453::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.5 started MainThread::INFO::2022-02-18 22:09:00,456::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2022-02-18 22:09:00,456::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2022-02-18 22:09:00,457::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2022-02-18 22:09:00,459::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2022-02-18 22:09:00,460::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2022-02-18 22:09:00,460::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors MainThread::WARNING::2022-02-18 22:10:00,788::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,788::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2022-02-18 22:10:00,789::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 370, in connect connection = util.connect_vdsm_json_rpc(logger=self._logger) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

vdsm.log:

2022-02-18 22:14:43,939+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:726) 2022-02-18 22:14:44,071+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48832 (protocoldetector:61) 2022-02-18 22:14:44,074+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:44,442+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48836 (protocoldetector:61) 2022-02-18 22:14:44,445+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:45,077+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48838 (protocoldetector:61) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] START repoStats(domains=()) from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:48) 2022-02-18 22:14:45,435+0000 INFO (periodic/2) [vdsm.api] FINISH repoStats return={} from=internal, task_id=2dd417e7-0f4f-4a09-a1af-725f267af135 (api:54) 2022-02-18 22:14:45,438+0000 WARN (periodic/2) [root] Failed to retrieve Hosted Engine HA info, is Hosted Engine setup finished? (api:194) 2022-02-18 22:14:45,447+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48840 (protocoldetector:61) 2022-02-18 22:14:45,449+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,082+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48842 (protocoldetector:61) 2022-02-18 22:14:46,084+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:46,452+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48844 (protocoldetector:61) 2022-02-18 22:14:46,455+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,087+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48846 (protocoldetector:61) 2022-02-18 22:14:47,089+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:47,457+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48848 (protocoldetector:61) 2022-02-18 22:14:47,459+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,092+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48850 (protocoldetector:61) 2022-02-18 22:14:48,094+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,461+0000 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:48852 (protocoldetector:61) 2022-02-18 22:14:48,464+0000 ERROR (Reactor thread) [ProtocolDetector.SSLHandshakeDispatcher] ssl handshake: SSLError, address: ::1 (sslutils:269) 2022-02-18 22:14:48,941+0000 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:48) 2022-02-18 22:14:48,942+0000 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=75ef5d5f-c56b-4595-95c8-3dc64caa3a83 (api:54)

...
On Feb 18, 2022, at 16:35, Strahil Nikolov via Users <users@ovirt.org> wrote:

ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown. Try to start ovirt-ha-broker & ovirt-ha-agent

Also, you may try to move the hosted-engine to ovirt-2 and try again

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas <joseph@gelinas.cc> wrote: I may be in maintenance mode, I did try to set it in the beginning of this, but engine-setup doesn't see it. At this point my nodes say they can't connect to the HA daemon, or have stale data.

[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global Cannot connect to the HA daemon, please check the logs.

[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global [root@ovirt-2 ~]# hosted-engine --vm-status

!! Cluster is in GLOBAL MAINTENANCE mode !!

--== Host ovirt-1.xxxxxx.com (id: 1) status ==--

Host ID : 1 Host timestamp : 6750990 Score : 0 Engine status : unknown stale-data Hostname : ovirt-1.xxxxxx.com Local maintenance : False stopped : True crc32 : 5290657b conf_on_shared_storage : True local_conf_timestamp : 6750950 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6750990 (Thu Feb 17 22:17:53 2022) host-id=1 score=0 vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-3.xxxxxx.com (id: 2) status ==--

Host ID : 2 Host timestamp : 6731526 Score : 0 Engine status : unknown stale-data Hostname : ovirt-3.xxxxxx.com Local maintenance : False stopped : True crc32 : 12c6b5c9 conf_on_shared_storage : True local_conf_timestamp : 6731486 Status up-to-date : False Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6731526 (Thu Feb 17 15:29:37 2022) host-id=2 score=0 vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022) conf_on_shared_storage=True maintenance=False state=AgentStopped stopped=True

--== Host ovirt-2.xxxxxx.com (id: 3) status ==--

Host ID : 3 Host timestamp : 6829853 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : ovirt-2.xxxxxx.com Local maintenance : False stopped : False crc32 : 0779c0b8 conf_on_shared_storage : True local_conf_timestamp : 6829853 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=6829853 (Fri Feb 18 19:25:17 2022) host-id=3 score=3400 vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False

!! Cluster is in GLOBAL MAINTENANCE mode !!

Ovirt-ha-agent on 1&3 just keeps trying to restart:

MainThread::ERROR::2022-02-18 19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2022-02-18 19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2022-02-18 19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.5 started MainThread::INFO::2022-02-18 19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::ERROR::2022-02-18 19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 436, in start_monitoring self._initialize_vdsm() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 595, in _initialize_vdsm logger=self._log File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds

Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.

MainThread::INFO::2022-02-18 19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2022-02-18 19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400)

Feb 17 18:49:12 ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python exception in '/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'

...
On Feb 18, 2022, at 14:20, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

To set the engine into maintenance mode you can ssh to any Hypervisor and run: 'hosted-engine --set-maintenance --mode=global' wait 1 minute and run 'hosted-engine --vm-status' to validate.

Best Regards, Strahil Nikolov

On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas <joseph@gelinas.cc> wrote: Hi,

The certificates on our oVirt stack recently expired, while all the VMs are still up, I can't put the cluster into global maintenance via ovirt-engine, or do anything via ovirt-engine for that matter. Just get event logs about cert validity.

VDSM ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed VDSM ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

Under Compute -> Hosts, all are status Unassigned. Default data center is status Non Responsive.

I have tried a couple of solutions to regenerate the certificates without much luck and have copied the originals back in place.

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/htm...

https://access.redhat.com/solutions/2409751

I have seen things saying running engine-setup will generate new certs, however engine doesn't think the cluster is in global maintenance so won't run that, I believe I can get around the check with `engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right thing to do? Will it deploy the certs on to the hosts as well so things communicate properly? Looks like one is supposed to put a node into maintenance and reenroll it after doing the engine-setup, but will it even be able to put the nodes into maintenance given I can't do anything with them now?

Appreciate any ideas.

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPOUV2...

Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7SCJI...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZE5DYLGQEFQ52...

...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/477NW53FXLCUFG...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/TSBVZC37VMVHEE...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/IVSH6USSPNIV7W...

labattz＠gmail.com

23 Sep 23 Sep

7:15 p.m.

I know this is old/dated, but I recently came across a similar situation and was able to resolve it. Thought I'd share my steps in case it helps someone else. In my environment, I had a 4 node cluster and the apache cert (and host certs) had expired. The hosted-engine cert was still valid. I had two hosts (node B and node D) that would not go into Global Maintenance as they could not connect to the HA-agent. Running engine-setup from within the hosted engine would error out that the cluster was not in Global Maintenance. I was able to tell Hosted Engine to forget about those two hosts. When you initiate "hosted-engine --vm-status" pay attention to the host ID of the hosts that aren't in Global Maintenance. To remove the hosts that were unable to receive up-to-date config (global maintenance) from the hosted-engine availability list, I executed the following from a host that IS showing Global Maintenance (node C): 'hosted-engine --clean-metadata --host-id=<host_id> --force-clean' eg: 'hosted-engine --clean-metadata --host-id=4 --force-clean' Now "hosted-engine --vm-status" will be consistent. However, in my case, this _still_ did not allow me to run engine-setup & continued to error out, saying that Global Maintenance was not set. Knowing that the remaining hosts were in fact in global maintenance, I issued: engine-setup --otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True --offline This allowed the setup to execute properly and I was able to renew the certs and configure the hosts accordingly. The hosts that I had cleared the meta-data from previously, were automatically re-added once I exited global maintenance. This cluster was properly recovered while still serving guests. -Andrew

1050

Age (days ago)

1267

Last active (days ago)

List overview

Download

18 comments

3 participants

participants (3)

Joseph Gelinas
labattz＠gmail.com
Strahil Nikolov

Certificate expiration

tags

participants (3)