Hi,
I am running a 3 node Ovirt cluster with a hosted-engine. Unfortunately I had a small
issue on my NFS server which provides the shared storage for the cluster. During the
outage, all VM's went into pause, and the cluster itself (hosted engine) went down.
After restoring nfs service (took 2 minutes), the cluster did not recover. The HA agent
can't make sense of the lockspace anymore it seems. The agent fails on all nodes.
There are several errors in logs but the main one is (I think) in broker.log:
MainThread::WARNING::2022-05-01
22:06:06,085::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__)
Can't connect vdsm storage: Command Image.prepare with args {'imageID':
'c1b7e131-bbde-416b-b2a0-de746a039dfd', 'storagepoolID':
'00000000-0000-0000-0000-000000000000', 'volumeID':
'b221ea37-2c59-49bf-89f7-83766fb53717', 'storagedomainID':
'e3b467ec-fdfc-4c7a-9725-8a6d1fe18c6d'} failed:
(code=201, message=Volume does not exist:
(u'b221ea37-2c59-49bf-89f7-83766fb53717',))
Which is weird because it seems to exist:
ls -l
/rhev/data-center/mnt/nas.fritz.box:_mnt_HD_HD__a2_hosted__engine__nas/e3b467ec-fdfc-4c7a-9725-8a6d1fe18c6d/images/c1b7e131-bbde-416b-b2a0-de746a039dfd
total 1049608
-rw-rw----. 1 vdsm kvm 1073741824 May 1 15:35 b221ea37-2c59-49bf-89f7-83766fb53717
-rw-rw----. 1 vdsm kvm 1048576 Dec 21 2020 b221ea37-2c59-49bf-89f7-83766fb53717.lease
-rw-rw-rw-. 1 vdsm kvm 329 Dec 21 2020 b221ea37-2c59-49bf-89f7-83766fb53717.meta
When I try to reinitialize the lockspace (stopping the agent etc) I get:
----
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py",
line 60, in connect
self.sock.connect(base64.b16decode(self.host))
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 2] No such file or directory
Is there a way to to this manually and also recreate the lockspace volume?
I already tried to recreate the lockspace manually with:
sanlock direct init -s
hosted-engine:0:/rhev/data-center/mnt/nas.fritz.box:_mnt_HD_HD__a2_hosted__engine__nas/e3b467ec-fdfc-4c7a-9725-8a6d1fe18c6d/ha_agent/hosted-engine.lockspace:0
which resulted in:
init done -19
No further info and nothing changed.
With kind regards,
Joost