I have a three node Ovirt cluster where one node has stale-data for the hosted engine, but
the other two nodes do not:
Output of `hosted-engine --vm-status` on a good node:
```
!! Cluster is in GLOBAL MAINTENANCE mode !!
--== Host
ovirt2.low.mdds.tcs-sec.com (id: 1) status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname :
ovirt2.low.mdds.tcs-sec.com
Host ID : 1
Engine status : {"health": "good",
"vm": "up", "detail": "Up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : f91f57e4
local_conf_timestamp : 9915242
Host timestamp : 9915241
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=9915241 (Fri Mar 27 14:38:14 2020)
host-id=1
score=3400
vm_conf_refresh_time=9915242 (Fri Mar 27 14:38:14 2020)
conf_on_shared_storage=True
maintenance=False
state=GlobalMaintenance
stopped=False
--== Host
ovirt1.low.mdds.tcs-sec.com (id: 2) status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname :
ovirt1.low.mdds.tcs-sec.com
Host ID : 2
Engine status : {"reason": "vm not running on this
host", "health": "bad", "vm": "down",
"detail": "unknown"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 48f9c0fc
local_conf_timestamp : 9218845
Host timestamp : 9218845
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=9218845 (Fri Mar 27 14:38:22 2020)
host-id=2
score=3400
vm_conf_refresh_time=9218845 (Fri Mar 27 14:38:22 2020)
conf_on_shared_storage=True
maintenance=False
state=GlobalMaintenance
stopped=False
--== Host
ovirt3.low.mdds.tcs-sec.com (id: 3) status ==--
conf_on_shared_storage : True
Status up-to-date : False
Hostname :
ovirt3.low.mdds.tcs-sec.com
Host ID : 3
Engine status : unknown stale-data
Score : 3400
stopped : False
Local maintenance : False
crc32 : 620c8566
local_conf_timestamp : 1208310
Host timestamp : 1208310
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=1208310 (Mon Dec 16 21:14:24 2019)
host-id=3
score=3400
vm_conf_refresh_time=1208310 (Mon Dec 16 21:14:24 2019)
conf_on_shared_storage=True
maintenance=False
state=GlobalMaintenance
stopped=False
!! Cluster is in GLOBAL MAINTENANCE mode !!
```
I tried the steps in
https://access.redhat.com/discussions/3511881, but `hosted-engine
--vm-status` on the node with stale data shows:
```
The hosted engine configuration has not been retrieved from shared storage. Please ensure
that ovirt-ha-agent is running and the storage server is reachable.
```
One the stale node, ovirt-ha-agent and ovirt-ha-broker are continually restarting. Since
it seems the agent depends on the broker, the broker logs includes this snippet, repeating
roughly every 3 seconds:
```
MainThread::INFO::2020-03-27
15:01:06,584::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
ovirt-hosted-engine-ha broker 2.3.6 started
MainThread::INFO::2020-03-27
15:01:06,584::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Searching for submonitors in
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors
MainThread::INFO::2020-03-27
15:01:06,585::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor engine-health
MainThread::INFO::2020-03-27
15:01:06,585::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor storage-domain
MainThread::INFO::2020-03-27
15:01:06,585::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor network
MainThread::INFO::2020-03-27
15:01:06,587::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load-no-engine
MainThread::INFO::2020-03-27
15:01:06,587::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mem-free
MainThread::INFO::2020-03-27
15:01:06,587::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor network
MainThread::INFO::2020-03-27
15:01:06,588::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mgmt-bridge
MainThread::INFO::2020-03-27
15:01:06,588::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor storage-domain
MainThread::INFO::2020-03-27
15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load-no-engine
MainThread::INFO::2020-03-27
15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor engine-health
MainThread::INFO::2020-03-27
15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mgmt-bridge
MainThread::INFO::2020-03-27
15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load
MainThread::INFO::2020-03-27
15:01:06,590::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load
MainThread::INFO::2020-03-27
15:01:06,590::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mem-free
MainThread::INFO::2020-03-27
15:01:06,590::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Finished loading submonitors
MainThread::INFO::2020-03-27
15:01:06,678::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
Connecting the storage
MainThread::INFO::2020-03-27
15:01:06,678::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::INFO::2020-03-27
15:01:06,717::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::INFO::2020-03-27
15:01:06,732::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Refreshing the storage domain
MainThread::WARNING::2020-03-27
15:01:08,940::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
Can't connect vdsm storage: [Errno 5] Input/output error:
'/rhev/data-center/mnt/glusterSD/ovirt2:_engine/182a4a94-743f-4941-89c1-dc2008ae1cf5/ha_agent/hosted-engine.lockspace'
```
I restarted the stale node yesterday, but it still shows stale data from December of last
year.
What is the recommended way for me to try to recover from this?
(This came to my attention when warnings concerning space on the /var/log partition began
popping up.)
Thank you,
Randall