Hi,
We have 6 node oVirt setup with Compatibility version 4.3 running on CentOS 7.6.1810
Recently We have found out some very interesting log entries on a few of our nodes.
/var/log/ovirt-hosted-engine-ha/broker.log:
MainThread::INFO::2020-08-27
12:51:50,279::storage_backends::345::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
Connecting the storage
MainThread::INFO::2020-08-27
12:51:50,280::storage_server::349::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::warning::2020-08-27
12:51:50,284::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
Can't connect vdsm storage: 'NoneType' object has no attribute
'startswith'
/var/log/messages
Aug 27 12:52:11 ildeco25 vdsm[6909]: ERROR failed to retrieve Hosted Engine HA score
'[Errno 2] No such file or directory'Is the Hosted Engine setup finished?
We have inherited that setup from the previous team but I'm fairly certain the the
Hosted Engine setup is finished.
Also, this line is not present on all hosts.
Should We worry about ovirt-ha-broker service not running on some hosts?
host24 Active: inactive (dead)
host25 Active: failed (Result: start-limit) since Thu 2020-08-27 12:56:13 CEST; 2s ago
host26 Active: active (running) since Sun 2020-08-23 22:55:01 CEST; 3 days ago
host27 Active: inactive (dead)
host28 Active: failed (Result: start-limit) since Thu 2020-08-27 12:56:18 CEST; 4s ago
host29 Active: inactive (dead)
Currently the hosted engine VM is running on host26.
I'd be very grateful for any help.
Kind regards,
Michal Bielejewski
Show replies by date
It would be interesting to know, how the previous team got to six nodes: I don't
remember seeing any documentation how to do that easily...
However, this state of affairs also seems to be quite normal, whenever I reboot a single
node HCI setup: I've seen that with two systems now, one running 4.3.11 on CentOS 7.8,
the other 4.4.1 on CentOS 8.2.
What seems to happen in my case is some sort of a race condition or time-out,
ovirt-ha-broker, ovirt-ha-agent and vdsmd all seem to fail in various ways, because
glusterd isn't showing perfect connectivity between all storage nodes (actually in
this case, it still fails to be perfect, even if there is only one node...)
I tend to restart glusterd carefully on any node that is seen as disconnected or not up
(gluster volume status all), and once that is perfect and any gluster heals are through, I
restart ovirt-ha-broker, ovirt-ha-agent and vdsmd nice and slow and not really in any
particilar order, I just have a look to see if they stop complaining or stopping via
systemctl status <name>.
In the mean-time I check with hosted-engine --vm-status on all nodes to see if this
"is the hosted engine setup finished" message disappears and with a bit of
patience, it tends to come back. You might also went to make sure, that none of the nodes
are on local maintenance or the whole data center is on global maintenance.
Let me tell you that I have pulled a lot of hair when I started with oVirt, because I tend
to expect immediate reactions to any command I give. But here there is such a lot of
automation going on in the background, that commands are really more like a bit of grease
on the cogs of a giant gearbox and most of the time it just works automagically.