On Thu, Oct 18, 2018 at 3:43 PM fsoyer <fsoyer@systea.fr> wrote:

Hi,
I forgot to look in the /var/log/messages file on the host ! What a shame :/
Here is the messages file at the time of the error : https://gist.github.com/fsoyer/4d1247d4c3007a8727459efd23d89737
At the sasme time, the second host as no particular messages in its log.
Does anyone have an idea of the source problem ?

The problem started when sanlock could not renew storage leases held by some processes:

Oct 16 11:01:46 victor sanlock[904]: 2018-10-16 11:01:46 2945585 [4167]: s3 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/ids

Oct 16 11:01:46 victor sanlock[904]: 2018-10-16 11:01:46 2945585 [4167]: s3 renewal error -202 delta_length 25 last_success 2945539

After 80 seconds, the vms are terminated by sanlock:

Oct 16 11:02:19 victor sanlock[904]: 2018-10-16 11:02:18 2945617 [904]: s1 check_our_lease failed 80

Oct 16 11:02:19 victor sanlock[904]: 2018-10-16 11:02:18 2945617 [904]: s1 kill 13823 sig 15 count 1

But process 13823 cannot be killed, since it is blocked on storage, so sanlock send many more

TERM signals:

Oct 16 11:02:33 victor sanlock[904]: 2018-10-16 11:02:33 2945633 [904]: s1 kill 13823 sig 15 count 17

The VM finally dies after 17 retries:

Oct 16 11:02:33 victor sanlock[904]: 2018-10-16 11:02:33 2945633 [904]: dead 13823 ci 10 count 17

We can see the same flow for other processes (HA VMs?)

This allows the system to start the HA VM

on another host, which is what we see in the events log in the first message.

Trying to restart VM npi2 on Host victor.local.systea.fr
16 oct. 2018 11:02:33
Highly Available VM npi2 failed. It will be restarted automatically.
16 oct. 2018 11:02:33
VM npi2 is down with error. Exit message: VM has been terminated on the host.

If the VMs were not started successfully on the other hosts, maybe the storage domain

used for VM lease is not accessible?

It is recommended to choose the same storage domain used by the other VM disks for

the VM lease.

Also check that all storage domains are accessible - if they are not you will have warnings

in /var/log/vdsm/vdsm.log.

Nir

--
Cordialement,

Frank

Le Mardi, Octobre 16, 2018 13:25 CEST, "fsoyer" <fsoyer@systea.fr> a écrit:

Hi all,
this morning, some of my VMs were restarted unexpectidly. The events in GUI say :
16 oct. 2018 11:03:50
Trying to restart VM patjoub1 on Host ginger.local.systea.fr
16 oct. 2018 11:03:26
Trying to restart VM op2drugs1 on Host victor.local.systea.fr
16 oct. 2018 11:03:23
Trying to restart VM npi2 on Host ginger.local.systea.fr
16 oct. 2018 11:02:54
Trying to restart VM op2drugs1 on Host victor.local.systea.fr
16 oct. 2018 11:02:54
Trying to restart VM patjoub1 on Host ginger.local.systea.fr
16 oct. 2018 11:02:53
Highly Available VM op2drugs1 failed. It will be restarted automatically.
16 oct. 2018 11:02:53
Failed to restart VM patjoub1 on Host victor.local.systea.fr
16 oct. 2018 11:02:53
VM op2drugs1 is down with error. Exit message: VM has been terminated on the host.
16 oct. 2018 11:02:53
VM patjoub1 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique.
16 oct. 2018 11:02:47
Trying to restart VM npi2 on Host ginger.local.systea.fr
16 oct. 2018 11:02:46
Failed to restart VM npi2 on Host victor.local.systea.fr
16 oct. 2018 11:02:46
VM npi2 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique.
16 oct. 2018 11:02:38
Trying to restart VM patjoub1 on Host victor.local.systea.fr
16 oct. 2018 11:02:37
Highly Available VM patjoub1 failed. It will be restarted automatically.
16 oct. 2018 11:02:37
VM patjoub1 is down with error. Exit message: VM has been terminated on the host.
16 oct. 2018 11:02:36
VM patjoub1 is not responding.
16 oct. 2018 11:02:36
VM altern8 is not responding.
16 oct. 2018 11:02:36
VM Sogov3 is not responding.
16 oct. 2018 11:02:36
VM cerbere3 is not responding.
16 oct. 2018 11:02:36
VM Mint19 is not responding.
16 oct. 2018 11:02:35
VM cerbere4 is not responding.
16 oct. 2018 11:02:35
VM zabbix is not responding.
16 oct. 2018 11:02:34
Trying to restart VM npi2 on Host victor.local.systea.fr
16 oct. 2018 11:02:33
Highly Available VM npi2 failed. It will be restarted automatically.
16 oct. 2018 11:02:33
VM npi2 is down with error. Exit message: VM has been terminated on the host.
16 oct. 2018 11:02:20
VM cerbere3 is not responding.
16 oct. 2018 11:02:20
VM logcollector is not responding.
16 oct. 2018 11:02:20
VM HostedEngine is not responding.
with engine. log : https://gist.github.com/fsoyer/e3b74b4693006736b4f737b642aed0ef
searching for "Failed to acquire lock" I see a post about sanlock.log. Here it is at the time of the restart : https://gist.github.com/fsoyer/8d6952e85623a12f09317652aa4babd7
(hope that you can display this gists)

First question : there is all the days those message "delta_renew long write time". What does this mean ? Even if I suspect some storage problem, I don't see latency on it (configuration described bellow).
Second question : what append that force some VMs (not all, and not and the sams host !) to restart ? Where and what must I search for ?
Thanks

Configuration
2 DELL R620 as ovirt hosts (4.2.8-2) with hosted-engine, also members of a gluster 3.12.13-1 cluster with an arbiter (1 DELL R310, non-ovirt). The DATAs and ENGINE storages are on gluster volumes. Around 11am, I do not see any specific messages in glusterd.log or glfsheal-*.log. Gluster is on a separate network (2*1G bond mode 4=aggegation) than ovirmgmt (2*1G bond mode 1=failover).

--
Regards,

Frank

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XFFJT4NORIELIOAGPHU4CUPC67KY3MMP/