Hello,
I have 3 node HCI cluster with glusterfs. oVirt 4.4.9.5-1. In last 2
weeks I experience 2 outages where HE and all/some vms were restarted.
While digging in logs I can see that sanlock cannot renew leases and it
leads to killing vms as is very good described in [1].
It looks to me like some hw issue with one of the hosts but cannot find
which one.
for example today's outage restarted vms on hosts 1 and 2 but not 3.
Sanlock logs
there are these lines in /var/log/messages on host 2 (ovirt-hci02)
Jan 13 08:27:25 ovirt-hci02 sanlock[1263]: 2022-01-13 08:27:25 1416706
[341378]: s7 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/10.0.4.11:_vms/6de5ae6d-c7cc-4292-bdbf-10495a38837b/dom_md/ids
Jan 13 08:28:59 ovirt-hci02 sanlock[1263]: 2022-01-13 08:28:59 1416800
[341257]: write_sectors delta_leader offset 1024 rv -202
/rhev/data-center/mnt/glusterSD/10.0.4.11:_engine/816a3d0b-2e10-4900-b3cb-4a9b5cd0dd5d/dom_md/ids
Jan 13 08:29:27 ovirt-hci02 sanlock[1263]: 2022-01-13 08:29:27 1416828
[4189968]: write_sectors delta_leader offset 1024 rv -202
/rhev/data-center/mnt/glusterSD/10.0.4.11:_engine/816a3d0b-2e10-4900-b3cb-4a9b5cd0dd5d/dom_md/ids
but not on hosts 1 and 3. Could it indicate that there could be storage
related problem on host 1?
could you please suggest further/better debugging approach?
Thanx a lot,
Jiri
[1]
https://www.ovirt.org/develop/developer-guide/vdsm/sanlock.html