
Hello, I have 3 node HCI cluster with glusterfs. oVirt 4.4.9.5-1. In last 2 weeks I experience 2 outages where HE and all/some vms were restarted. While digging in logs I can see that sanlock cannot renew leases and it leads to killing vms as is very good described in [1]. It looks to me like some hw issue with one of the hosts but cannot find which one. for example today's outage restarted vms on hosts 1 and 2 but not 3. Sanlock logs there are these lines in /var/log/messages on host 2 (ovirt-hci02) Jan 13 08:27:25 ovirt-hci02 sanlock[1263]: 2022-01-13 08:27:25 1416706 [341378]: s7 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/10.0.4.11:_vms/6de5ae6d-c7cc-4292-bdbf-10495a38837b/dom_md/ids Jan 13 08:28:59 ovirt-hci02 sanlock[1263]: 2022-01-13 08:28:59 1416800 [341257]: write_sectors delta_leader offset 1024 rv -202 /rhev/data-center/mnt/glusterSD/10.0.4.11:_engine/816a3d0b-2e10-4900-b3cb-4a9b5cd0dd5d/dom_md/ids Jan 13 08:29:27 ovirt-hci02 sanlock[1263]: 2022-01-13 08:29:27 1416828 [4189968]: write_sectors delta_leader offset 1024 rv -202 /rhev/data-center/mnt/glusterSD/10.0.4.11:_engine/816a3d0b-2e10-4900-b3cb-4a9b5cd0dd5d/dom_md/ids but not on hosts 1 and 3. Could it indicate that there could be storage related problem on host 1? could you please suggest further/better debugging approach? Thanx a lot, Jiri [1] https://www.ovirt.org/develop/developer-guide/vdsm/sanlock.html