On March 27, 2020 5:06:16 PM GMT+02:00, "Wood, Randall"
<rwood(a)forcepoint.com> wrote:
I have a three node Ovirt cluster where one node has stale-data for
the
hosted engine, but the other two nodes do not:
Output of `hosted-engine --vm-status` on a good node:
```
!! Cluster is in GLOBAL MAINTENANCE mode !!
--== Host
ovirt2.low.mdds.tcs-sec.com (id: 1) status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname :
ovirt2.low.mdds.tcs-sec.com
Host ID : 1
Engine status : {"health": "good",
"vm": "up",
"detail": "Up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : f91f57e4
local_conf_timestamp : 9915242
Host timestamp : 9915241
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=9915241 (Fri Mar 27 14:38:14 2020)
host-id=1
score=3400
vm_conf_refresh_time=9915242 (Fri Mar 27 14:38:14 2020)
conf_on_shared_storage=True
maintenance=False
state=GlobalMaintenance
stopped=False
--== Host
ovirt1.low.mdds.tcs-sec.com (id: 2) status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname :
ovirt1.low.mdds.tcs-sec.com
Host ID : 2
Engine status : {"reason": "vm not running on this
host", "health": "bad", "vm": "down",
"detail": "unknown"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 48f9c0fc
local_conf_timestamp : 9218845
Host timestamp : 9218845
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=9218845 (Fri Mar 27 14:38:22 2020)
host-id=2
score=3400
vm_conf_refresh_time=9218845 (Fri Mar 27 14:38:22 2020)
conf_on_shared_storage=True
maintenance=False
state=GlobalMaintenance
stopped=False
--== Host
ovirt3.low.mdds.tcs-sec.com (id: 3) status ==--
conf_on_shared_storage : True
Status up-to-date : False
Hostname :
ovirt3.low.mdds.tcs-sec.com
Host ID : 3
Engine status : unknown stale-data
Score : 3400
stopped : False
Local maintenance : False
crc32 : 620c8566
local_conf_timestamp : 1208310
Host timestamp : 1208310
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=1208310 (Mon Dec 16 21:14:24 2019)
host-id=3
score=3400
vm_conf_refresh_time=1208310 (Mon Dec 16 21:14:24 2019)
conf_on_shared_storage=True
maintenance=False
state=GlobalMaintenance
stopped=False
!! Cluster is in GLOBAL MAINTENANCE mode !!
```
I tried the steps in
https://access.redhat.com/discussions/3511881, but
`hosted-engine --vm-status` on the node with stale data shows:
```
The hosted engine configuration has not been retrieved from shared
storage. Please ensure that ovirt-ha-agent is running and the storage
server is reachable.
```
One the stale node, ovirt-ha-agent and ovirt-ha-broker are continually
restarting. Since it seems the agent depends on the broker, the broker
logs includes this snippet, repeating roughly every 3 seconds:
```
MainThread::INFO::2020-03-27
15:01:06,584::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
ovirt-hosted-engine-ha broker 2.3.6 started
MainThread::INFO::2020-03-27
15:01:06,584::monitor::40::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Searching for submonitors in
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors
MainThread::INFO::2020-03-27
15:01:06,585::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor engine-health
MainThread::INFO::2020-03-27
15:01:06,585::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor storage-domain
MainThread::INFO::2020-03-27
15:01:06,585::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor network
MainThread::INFO::2020-03-27
15:01:06,587::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load-no-engine
MainThread::INFO::2020-03-27
15:01:06,587::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mem-free
MainThread::INFO::2020-03-27
15:01:06,587::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor network
MainThread::INFO::2020-03-27
15:01:06,588::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mgmt-bridge
MainThread::INFO::2020-03-27
15:01:06,588::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor storage-domain
MainThread::INFO::2020-03-27
15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load-no-engine
MainThread::INFO::2020-03-27
15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor engine-health
MainThread::INFO::2020-03-27
15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mgmt-bridge
MainThread::INFO::2020-03-27
15:01:06,589::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load
MainThread::INFO::2020-03-27
15:01:06,590::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load
MainThread::INFO::2020-03-27
15:01:06,590::monitor::49::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mem-free
MainThread::INFO::2020-03-27
15:01:06,590::monitor::50::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Finished loading submonitors
MainThread::INFO::2020-03-27
15:01:06,678::storage_backends::373::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
Connecting the storage
MainThread::INFO::2020-03-27
15:01:06,678::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::INFO::2020-03-27
15:01:06,717::storage_server::356::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::INFO::2020-03-27
15:01:06,732::storage_server::413::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Refreshing the storage domain
MainThread::WARNING::2020-03-27
15:01:08,940::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
Can't connect vdsm storage: [Errno 5] Input/output error:
'/rhev/data-center/mnt/glusterSD/ovirt2:_engine/182a4a94-743f-4941-89c1-dc2008ae1cf5/ha_agent/hosted-engine.lockspace'
```
I restarted the stale node yesterday, but it still shows stale data
from December of last year.
What is the recommended way for me to try to recover from this?
(This came to my attention when warnings concerning space on the
/var/log partition began popping up.)
Thank you,
Randall
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/5TWVACR6PAD...
Hey Randall,
This is the key:
Can't connect vdsm storage: [Errno 5] Input/output error:
'/rhev/data-center/mnt/glusterSD/ovirt2:_engine/182a4a94-743f-4941-89c1-dc2008ae1cf5/ha_agent/hosted-engine.lockspace'
Go to the folder and check the links.
Actually you can remove them and then the broker will recreate them.
Sometimes (when using gluster) there could be a split brain - in such case , just remove
the links on the offending brick and the broker will be allowed to access or recreate the
link.
Best Regards,
Strahil Nikolov