Hi all,
I had put a specif email alert during the deploy and then I wanted to change it.
I did the following:
At one of the hosts ra:
hosted-engine --set-shared-config destination-emails alerts@domain.com --type=broker
systemctl restart ovirt-ha-broker.service
I had to do the above since changing the email from GUI did not have any effect.
After the above the emails are received at the new email address but the cluster seems to have some issue recognizing the state of engine. i am flooded with emails that "
EngineMaybeAway-EngineUnexpectedlyDown
"
I have restarted at each host also the ovirt-ha-agent.service.
Did put the cluster to global maintenance and then disabled global maintenance.
host agent logs I have:
MainThread::ERROR::2018-02-18 11:12:20,751::hosted_engine::720::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) cannot get lock on host id 1: host already holds lock on a different host id
One other host logs:
MainThread::INFO::2018-02-18 11:20:23,692::states::682::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown at Sun Feb 18 11:15:13 2018
MainThread::INFO::2018-02-18 11:20:23,692::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUnexpectedlyDown (score: 0)
The engine status on 3 hosts is:
hosted-engine --vm-status
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : v0
Host ID : 1
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 0
stopped : False
Local maintenance : False
crc32 : cfd15dac
local_conf_timestamp : 4721144
Host timestamp : 4721144
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=4721144 (Sun Feb 18 11:20:33 2018)
host-id=1
score=0
vm_conf_refresh_time=4721144 (Sun Feb 18 11:20:33 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineUnexpectedlyDown
stopped=False
timeout=Tue Feb 24 15:29:44 1970
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : v1
Host ID : 2
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score : 0
stopped : False
Local maintenance : False
crc32 : 5cbcef4c
local_conf_timestamp : 2499416
Host timestamp : 2499416
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=2499416 (Sun Feb 18 11:20:46 2018)
host-id=2
score=0
vm_conf_refresh_time=2499416 (Sun Feb 18 11:20:46 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineUnexpectedlyDown
stopped=False
timeout=Thu Jan 29 22:18:42 1970
--== Host 3 status ==--
conf_on_shared_storage : True
Status up-to-date : False
Hostname : v2
Host ID : 3
Engine status : unknown stale-data
Score : 3400
stopped : False
Local maintenance : False
crc32 : f064d529
local_conf_timestamp : 2920612
Host timestamp : 2920611
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=2920611 (Sun Feb 18 10:47:31 2018)
host-id=3
score=3400
vm_conf_refresh_time=2920612 (Sun Feb 18 10:47:32 2018)
conf_on_shared_storage=True
maintenance=False
state=GlobalMaintenance
stopped=False
Putting each host at maintenance then activating them back does not resolve the issue. Seems I have to avoid defining email address during deploy and have it set only later at GUI.
How one can recover from this situation?
Thanx,
Alex