I have a very similar problem after update one of the two nodes to version 4.3.1. This node77-02 lost connection to gluster volume named DATA, but not to volume with hosted engine.
Mar 18 13:40:00 node77-02 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed scanning for OVF_STORE due to Command Volume.getInfo with args {'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': '2ee71105-1810-46eb-9388-cc6caccf9fac', 'volumeID': u'224e4b80-2744-4d7f-bd9f-43eb8fe6cf11', 'imageID': u'43b75b50-cad4-411f-8f51-2e99e52f4c77'} failed:#012(code=201, message=Volume does not exist: (u'224e4b80-2744-4d7f-bd9f-43eb8fe6cf11',))
Mar 18 13:40:00 node77-02 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs
HostedEngine VM works fine on all nodes. But node77-02 failed with
error in webUI:
ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5a5cca91-01f8-01af-0297-00000000025f, msdUUID=7d5de684-58ff-4fbc-905d-3048fc55b2b1'
node77-02 vdsm.log
2019-03-18 13:51:46,287+0300 WARN (jsonrpc/7) [storage.StorageServer.MountConnection] gluster server u'msk-gluster-facility.xxxx' is not in bricks ['node-msk-gluster203', 'node-msk-gluster205', 'node-msk-gluster201'], possibly mounting duplicate servers (storageServer:317)
2019-03-18 13:51:46,287+0300 INFO (jsonrpc/7) [storage.Mount] mounting msk-gluster-facility.ipt.fsin.uis:/data at /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data (mount:204)
2019-03-18 13:51:46,474+0300 ERROR (jsonrpc/7) [storage.HSM] Could not connect to storageServer (hsm:2415)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2412, in connectStorageServer
conObj.connect()
File "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 179, in connect
six.reraise(t, v, tb)
File "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 171, in connect
self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 207, in mount
cgroup=cgroup)
File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
return callMethod()
File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
**kwargs)
File "<string>", line 2, in mount
File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
raise convert_to_error(kind, result)
MountError: (1, ';Running scope as unit run-10121.scope.\nMount failed. Please check the log file for more details.\n')
------------------------------
2019-03-18 13:51:46,830+0300 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='fe81642e-2421-4169-a08b-51467e8f01fe') Unexpected error (task:875)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
return fn(*args, **kargs)
File "<string>", line 2, in connectStoragePool
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1035, in connectStoragePool
spUUID, hostID, msdUUID, masterVersion, domainsMap)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1097, in _connectStoragePool
res = pool.connect(hostID, msdUUID, masterVersion)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 700, in connect
self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1274, in __rebuild
self.setMasterDomain(msdUUID, masterVersion)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1495, in setMasterDomain
raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: u'spUUID=5a5cca91-01f8-01af-0297-00000000025f, msdUUID=7d5de684-58ff-4fbc-905d-3048fc55b2b1'
I have set again to global maintenance, defined the HostedEngine from the old xml (taken from old vdsm log) , defined the network and powered it off.
Set the OVF update period to 5 min , but it took several hours until the OVF_STORE were updated. Once this happened I restarted the ovirt-ha-agent ovirt-ha-broker on both nodes.Then I powered off the HostedEngine and undefined it from ovirt1.
then I set the maintenance to 'none' and the VM powered on ovirt1.
In order to test a failure, I removed the global maintenance and powered off the HostedEngine from itself (via ssh). It was brought back to the other node.
In order to test failure of ovirt2, I set ovirt1 in local maintenance and removed it (mode 'none') and again shutdown the VM via ssh and it started again to ovirt1.
It seems to be working, as I have later shut down the Engine several times and it managed to start without issues.
I'm not sure this is related, but I had detected that ovirt2 was out-of-sync of the vdsm-ovirtmgmt network , but it got fixed easily via the UI.