Hi,
thanks for reply
Dne 1/13/22 v 16:04 Vojtech Juranek napsal(a):
Hi,
> Hello,
>
> I have 3 node HCI cluster with glusterfs. oVirt 4.4.9.5-1. In last 2
> weeks I experience 2 outages where HE and all/some vms were restarted.
> While digging in logs I can see that sanlock cannot renew leases and it
> leads to killing vms as is very good described in [1].
>
> It looks to me like some hw issue with one of the hosts but cannot find
> which one.
when you check the sanlock logs (/var/log/sanlock.log) around the time of
outage, you should be able to see which of the host failed to renew its
sanlock leases. It could be on some of them (could be some issue with these
host(s)) or on all of them (in this case is more likely a network issue or
storage issue).
it looks like all of hosts had renewal issues, first was ovirt-hci02
Jan 13 08:27:25 ovirt-hci02 sanlock[1263]: 2022-01-13 08:27:25 1416706
[341378]: s7 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/10.0.4.11:_vms/6de5ae6d-c7cc-4292-bdbf-10495a38837b/dom_md/ids
Jan 13 08:27:25 ovirt-hci02 sanlock[1263]: 2022-01-13 08:27:25 1416706
[341378]: s7 renewal error -202 delta_length 10 last_success 1416676
Jan 13 08:27:31 ovirt-hci01 sanlock[1375]: 2022-01-13 08:27:31 1420170
[766769]: s7 delta_renew long write time 20 sec
Jan 13 08:27:32 ovirt-hci03 sanlock[1457]: 2022-01-13 08:27:32 1412428
[761241]: s6 delta_renew long write time 11 sec
Jan 13 08:27:32 ovirt-hci01 sanlock[1375]: 2022-01-13 08:27:32 1420171
[764099]: s6 delta_renew long write time 18 sec
Jan 13 08:27:42 ovirt-hci02 sanlock[1263]: 2022-01-13 08:27:42 1416723
[341376]: s6 delta_renew long write time 21 sec
Jan 13 08:27:42 ovirt-hci02 sanlock[1263]: 2022-01-13 08:27:42 1416723
[341376]: s6 renewed 1416702 delta_length 22 too long
Jan 13 08:27:42 ovirt-hci01 sanlock[1375]: 2022-01-13 08:27:42 1420181
[766769]: s7 delta_renew long write time 11 sec
Jan 13 08:27:44 ovirt-hci03 sanlock[1457]: 2022-01-13 08:27:44 1412440
[761233]: s5 delta_renew long write time 30 sec
Jan 13 08:27:44 ovirt-hci03 sanlock[1457]: 2022-01-13 08:27:44 1412440
[761233]: s5 renewed 1412410 delta_length 30 too long
...
good point is that it could be a network issue but I have no proof of
it. Switches (10GE) were not restarted, no errors on interfaces, no log
entries, no excessive traffic on any interface...
also I am confused with line "...read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/10.0.4.11:_vms/6de5ae6d-c7cc-4292-bdbf-10495a38837b/dom_md/ids".
If I understand it correctly it logs just mount info, not exact ip of
gluster host from which cannot read host ovirt-hci02 lock data, right?
Also, if you want only find out which hosts wasn't able to renew
the leases,
it's even more easy - it was the host whose VMs were killed. If the host runs
HA VMs and host is not able to renew its leases, sanlock will kill VMs running
on this host.
vms were killed on ovirt-hci01 and ovirt-hci02 (I believe ovirt-hci01
hosted also HE in that time). It looks like vms were first killed (or
better started to kill) on ovirt-hci02 at 8:28:15, then on ovirt-hci01
at 8:28:51.
Cheers,
Jiri
Vojta
> for example today's outage restarted vms on hosts 1 and 2 but not 3.
> Sanlock logs
>
> there are these lines in /var/log/messages on host 2 (ovirt-hci02)
>
> Jan 13 08:27:25 ovirt-hci02 sanlock[1263]: 2022-01-13 08:27:25 1416706
> [341378]: s7 delta_renew read timeout 10 sec offset 0
> /rhev/data-center/mnt/glusterSD/10.0.4.11:_vms/6de5ae6d-c7cc-4292-bdbf-10495
> a38837b/dom_md/ids Jan 13 08:28:59 ovirt-hci02 sanlock[1263]: 2022-01-13
> 08:28:59 1416800 [341257]: write_sectors delta_leader offset 1024 rv -202
> /rhev/data-center/mnt/glusterSD/10.0.4.11:_engine/816a3d0b-2e10-4900-b3cb-4a
> 9b5cd0dd5d/dom_md/ids Jan 13 08:29:27 ovirt-hci02 sanlock[1263]: 2022-01-13
> 08:29:27 1416828 [4189968]: write_sectors delta_leader offset 1024 rv -202
> /rhev/data-center/mnt/glusterSD/10.0.4.11:_engine/816a3d0b-2e10-4900-b3cb-4a
> 9b5cd0dd5d/dom_md/ids
>
> but not on hosts 1 and 3. Could it indicate that there could be storage
> related problem on host 1?
>
> could you please suggest further/better debugging approach?
>
> Thanx a lot,
>
> Jiri
>
> [1]
https://www.ovirt.org/develop/developer-guide/vdsm/sanlock.html
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/UJYSBUBM3CG...