Lockspace volume not recognized after NFS outage

Hi, I am running a 3 node Ovirt cluster with a hosted-engine. Unfortunately I had a small issue on my NFS server which provides the shared storage for the cluster. During the outage, all VM's went into pause, and the cluster itself (hosted engine) went down. After restoring nfs service (took 2 minutes), the cluster did not recover. The HA agent can't make sense of the lockspace anymore it seems. The agent fails on all nodes. There are several errors in logs but the main one is (I think) in broker.log: MainThread::WARNING::2022-05-01 22:06:06,085::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Command Image.prepare with args {'imageID': 'c1b7e131-bbde-416b-b2a0-de746a039dfd', 'storagepoolID': '00000000-0000-0000-0000-000000000000', 'volumeID': 'b221ea37-2c59-49bf-89f7-83766fb53717', 'storagedomainID': 'e3b467ec-fdfc-4c7a-9725-8a6d1fe18c6d'} failed: (code=201, message=Volume does not exist: (u'b221ea37-2c59-49bf-89f7-83766fb53717',)) Which is weird because it seems to exist: ls -l /rhev/data-center/mnt/nas.fritz.box:_mnt_HD_HD__a2_hosted__engine__nas/e3b467ec-fdfc-4c7a-9725-8a6d1fe18c6d/images/c1b7e131-bbde-416b-b2a0-de746a039dfd total 1049608 -rw-rw----. 1 vdsm kvm 1073741824 May 1 15:35 b221ea37-2c59-49bf-89f7-83766fb53717 -rw-rw----. 1 vdsm kvm 1048576 Dec 21 2020 b221ea37-2c59-49bf-89f7-83766fb53717.lease -rw-rw-rw-. 1 vdsm kvm 329 Dec 21 2020 b221ea37-2c59-49bf-89f7-83766fb53717.meta When I try to reinitialize the lockspace (stopping the agent etc) I get: ---- File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 60, in connect self.sock.connect(base64.b16decode(self.host)) File "/usr/lib64/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) socket.error: [Errno 2] No such file or directory Is there a way to to this manually and also recreate the lockspace volume? I already tried to recreate the lockspace manually with: sanlock direct init -s hosted-engine:0:/rhev/data-center/mnt/nas.fritz.box:_mnt_HD_HD__a2_hosted__engine__nas/e3b467ec-fdfc-4c7a-9725-8a6d1fe18c6d/ha_agent/hosted-engine.lockspace:0 which resulted in: init done -19 No further info and nothing changed. With kind regards, Joost

Il giorno lun 2 mag 2022 alle ore 17:22 <joustie@gmail.com> ha scritto:
Hi,
I am running a 3 node Ovirt cluster with a hosted-engine.
[cut]
When I try to reinitialize the lockspace (stopping the agent etc) I get: ---- File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 60, in connect self.sock.connect(base64.b16decode(self.host)) File "/usr/lib64/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) socket.error: [Errno 2] No such file or directory
Hi, the last oVirt release using python2.7 was oVirt 4.3 which went EOL 2 years ago. As a project we don't consider reports against old releases anymore. Please upgrade as soon as practical and if it still reproduces please let us know and we'll be happy to help getting it fixed. Thanks, -- Sandro Bonazzola MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV Red Hat EMEA <https://www.redhat.com/> sbonazzo@redhat.com <https://www.redhat.com/> *Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.*
participants (2)
-
joustie@gmail.com
-
Sandro Bonazzola