On Wed, Feb 16, 2022 at 10:10 AM Pablo Olivera <p.olivera(a)telfy.com> wrote:
Hi community,
We're dealing with an issue as we occasionally have random reboots on
any of our hosts.
We're using ovirt 4.4.3 in production with about 60 VM distributed over
5 hosts. We've a virtualized engine and a DRBD storage mounted by NFS.
The infrastructure is interconnected by a Cisco 9000 switch.
The last random reboot was yesterday February 14th at 03:03 PM (in the
log it appears as: 15:03 due to our time configuration) of the host:
'nodo1'.
At the moment of the reboot we detected in the log of the switch a
link-down in the port where the host is connected.
I attach log of the engine and host 'nodo1' in case you can help us to
find the cause of these random reboots.
According to messages:
1. Sanlock could not renew the lease for 80 seconds:
Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257
[2017]: s1 check_our_lease failed 80
2. In this case sanlock must terminate the processes holding a lease
on the that storage - I guess that pid 6398 is vdsm.
Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257
[2017]: s1 kill 6398 sig 15 count 1
Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655258
[2017]: s1 kill 6398 sig 15 count 2
...
Feb 14 15:03:36 nodo1 sanlock[2017]: 2022-02-14 15:03:36 1655288
[2017]: s1 kill 6398 sig 15 count 32
3. Terminating pid 6398 stopped here, and we see:
Feb 14 15:03:36 nodo1 wdmd[2033]: test failed rem 19 now 1655288 ping
1655237 close 1655247 renewal 1655177 expire 1655257 client 2017
sanlock_a5c35d19-4c34-4571-ac77-1b10de484426:1
4. So it looks like wdmd rebooted the host.
Feb 14 15:08:09 nodo1 kernel: Linux version
4.18.0-193.28.1.el8_2.x86_64 (mockbuild(a)kbuilder.bsys.centos.org) (gcc
version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Oct 22
00:20:22 UTC 2020
This is strange, since sanlock should try to kill pid 6398 40 times,
and then switch
to SIGKILL. The watchdog should not reboot the host before sanlock
finish the attempt to kill the processes.
David, do you think this is expected? do we have any issue in sanlock?
It is possible that sanlock will not be able to terminate a process if
the process is blocked on inaccessible storage. This seems to be the
case here.
In vdsm log we see that storage is indeed inaccessible:
2022-02-14 15:03:03,149+0100 WARN (check/loop) [storage.check]
Checker
'/rhev/data-center/mnt/newstoragedrbd.andromeda.com:_var_nfsshare_data/a5c35d19-4c34-4571-ac77-1b10de484426/dom_md/metadata'
is blocked for 60.00 seconds (check:282)
But we don't see any termination request - so this host is not the SPM.
I guess this host was running the hosted engine vm, which uses a storage lease.
If you lose access to storage, sanlcok will kill the hosted engine vm,
so the system
can start it elsewhere. If the hosted engine vm is stuck on storage, sanlock
cannot kill it and it will reboot the host.
Nir