ovirt-2 is 'state=GlobalMaintenance' , but the other 2 nodes is uknown.Try to
start ovirt-ha-broker & ovirt-ha-agent
Also, you may try to move the hosted-engine to ovirt-2 and try again
Best Regards,Strahil Nikolov
On Fri, Feb 18, 2022 at 21:48, Joseph Gelinas<joseph(a)gelinas.cc> wrote: I may be
in maintenance mode, I did try to set it in the beginning of this, but engine-setup
doesn't see it. At this point my nodes say they can't connect to the HA daemon, or
have stale data.
[root@ovirt-1 ~]# hosted-engine --set-maintenance --mode=global
Cannot connect to the HA daemon, please check the logs.
[root@ovirt-3 ~]# hosted-engine --set-maintenance --mode=global
Cannot connect to the HA daemon, please check the logs.
[root@ovirt-2 ~]# hosted-engine --set-maintenance --mode=global
[root@ovirt-2 ~]# hosted-engine --vm-status
!! Cluster is in GLOBAL MAINTENANCE mode !!
--== Host
ovirt-1.xxxxxx.com (id: 1) status ==--
Host ID : 1
Host timestamp : 6750990
Score : 0
Engine status : unknown stale-data
Hostname :
ovirt-1.xxxxxx.com
Local maintenance : False
stopped : True
crc32 : 5290657b
conf_on_shared_storage : True
local_conf_timestamp : 6750950
Status up-to-date : False
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=6750990 (Thu Feb 17 22:17:53 2022)
host-id=1
score=0
vm_conf_refresh_time=6750950 (Thu Feb 17 22:17:12 2022)
conf_on_shared_storage=True
maintenance=False
state=AgentStopped
stopped=True
--== Host
ovirt-3.xxxxxx.com (id: 2) status ==--
Host ID : 2
Host timestamp : 6731526
Score : 0
Engine status : unknown stale-data
Hostname :
ovirt-3.xxxxxx.com
Local maintenance : False
stopped : True
crc32 : 12c6b5c9
conf_on_shared_storage : True
local_conf_timestamp : 6731486
Status up-to-date : False
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=6731526 (Thu Feb 17 15:29:37 2022)
host-id=2
score=0
vm_conf_refresh_time=6731486 (Thu Feb 17 15:28:57 2022)
conf_on_shared_storage=True
maintenance=False
state=AgentStopped
stopped=True
--== Host
ovirt-2.xxxxxx.com (id: 3) status ==--
Host ID : 3
Host timestamp : 6829853
Score : 3400
Engine status : {"vm": "down",
"health": "bad", "detail": "unknown",
"reason": "vm not running on this host"}
Hostname :
ovirt-2.xxxxxx.com
Local maintenance : False
stopped : False
crc32 : 0779c0b8
conf_on_shared_storage : True
local_conf_timestamp : 6829853
Status up-to-date : True
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=6829853 (Fri Feb 18 19:25:17 2022)
host-id=3
score=3400
vm_conf_refresh_time=6829853 (Fri Feb 18 19:25:17 2022)
conf_on_shared_storage=True
maintenance=False
state=GlobalMaintenance
stopped=False
!! Cluster is in GLOBAL MAINTENANCE mode !!
Ovirt-ha-agent on 1&3 just keeps trying to restart:
MainThread::ERROR::2022-02-18
19:34:36,910::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to
restart agent
MainThread::INFO::2022-02-18
19:34:36,910::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting
down
MainThread::INFO::2022-02-18
19:34:47,268::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
ovirt-hosted-engine-ha agent 2.4.5 started
MainThread::INFO::2022-02-18
19:34:47,280::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
Certificate common name not found, using hostname to identify host
MainThread::ERROR::2022-02-18
19:35:47,629::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback
(most recent call last):
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
line 131, in _run_agent
return action(he)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
line 55, in action_proper
return he.start_monitoring()
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 436, in start_monitoring
self._initialize_vdsm()
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 595, in _initialize_vdsm
logger=self._log
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py",
line 472, in connect_vdsm_json_rpc
__vdsm_json_rpc_connect(logger, timeout)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py",
line 415, in __vdsm_json_rpc_connect
timeout=VDSM_MAX_RETRY * VDSM_DELAY
RuntimeError: Couldn't connect to VDSM within 60 seconds
Ovirt-2's ovirt-hosted-engine-ha/agent.log has entries detecting global maintenance
though `systemctl status ovirt-ha-agent` has python exception errors from yesterday.
MainThread::INFO::2022-02-18
19:39:10,452::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
Global maintenance detected
MainThread::INFO::2022-02-18
19:39:10,524::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state GlobalMaintenance (score: 3400)
Feb 17 18:49:12
ovirt-2.us1.vricon.com python3[1324125]: detected unhandled Python
exception in
'/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py'
On Feb 18, 2022, at 14:20, Strahil Nikolov
<hunter86_bg(a)yahoo.com> wrote:
To set the engine into maintenance mode you can ssh to any Hypervisor and run:
'hosted-engine --set-maintenance --mode=global'
wait 1 minute and run 'hosted-engine --vm-status' to validate.
Best Regards,
Strahil Nikolov
On Fri, Feb 18, 2022 at 19:03, Joseph Gelinas
<joseph(a)gelinas.cc> wrote:
Hi,
The certificates on our oVirt stack recently expired, while all the VMs are still up, I
can't put the cluster into global maintenance via ovirt-engine, or do anything via
ovirt-engine for that matter. Just get event logs about cert validity.
VDSM
ovirt-1.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed:
java.security.cert.CertPathValidatorException: validity check failed
VDSM
ovirt-2.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed:
java.security.cert.CertPathValidatorException: validity check failed
VDSM
ovirt-3.xxxxx.com command Get Host Capabilities failed: PKIX path validation failed:
java.security.cert.CertPathValidatorException: validity check failed
Under Compute -> Hosts, all are status Unassigned. Default data center is status Non
Responsive.
I have tried a couple of solutions to regenerate the certificates without much luck and
have copied the originals back in place.
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.3/...
https://access.redhat.com/solutions/2409751
I have seen things saying running engine-setup will generate new certs, however engine
doesn't think the cluster is in global maintenance so won't run that, I believe I
can get around the check with `engine-setup
--otopi-environment=OVESETUP_CONFIG/continueSetupOnHEVM=bool:True` but is that the right
thing to do? Will it deploy the certs on to the hosts as well so things communicate
properly? Looks like one is supposed to put a node into maintenance and reenroll it after
doing the engine-setup, but will it even be able to put the nodes into maintenance given I
can't do anything with them now?
Appreciate any ideas.
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/QCFPKQ3OKPO...
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/XOQBFYM5W7S...