VMs unexpectidly restarted

Hi all, this morning, some of my VMs were restarted unexpectidly. The events in GUI say : 16 oct. 2018 11:03:50 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:03:26 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:03:23 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:02:53 Highly Available VM op2drugs1 failed. It will be restarted automatically. 16 oct. 2018 11:02:53 Failed to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:53 VM op2drugs1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:53 VM patjoub1 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:47 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:46 Failed to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:46 VM npi2 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:38 Trying to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:37 Highly Available VM patjoub1 failed. It will be restarted automatically. 16 oct. 2018 11:02:37 VM patjoub1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:36 VM patjoub1 is not responding. 16 oct. 2018 11:02:36 VM altern8 is not responding. 16 oct. 2018 11:02:36 VM Sogov3 is not responding. 16 oct. 2018 11:02:36 VM cerbere3 is not responding. 16 oct. 2018 11:02:36 VM Mint19 is not responding. 16 oct. 2018 11:02:35 VM cerbere4 is not responding. 16 oct. 2018 11:02:35 VM zabbix is not responding. 16 oct. 2018 11:02:34 Trying to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:33 Highly Available VM npi2 failed. It will be restarted automatically. 16 oct. 2018 11:02:33 VM npi2 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:20 VM cerbere3 is not responding. 16 oct. 2018 11:02:20 VM logcollector is not responding. 16 oct. 2018 11:02:20 VM HostedEngine is not responding.with engine. log : https://gist.github.com/fsoyer/e3b74b4693006736b4f737b642aed0ef searching for "Failed to acquire lock" I see a post about sanlock.log. Here it is at the time of the restart : https://gist.github.com/fsoyer/8d6952e85623a12f09317652aa4babd7 (hope that you can display this gists) First question : there is all the days those message "delta_renew long write time". What does this mean ? Even if I suspect some storage problem, I don't see latency on it (configuration described bellow). Second question : what append that force some VMs (not all, and not and the sams host !) to restart ? Where and what must I search for ? Thanks Configuration 2 DELL R620 as ovirt hosts (4.2.8-2) with hosted-engine, also members of a gluster 3.12.13-1 cluster with an arbiter (1 DELL R310, non-ovirt). The DATAs and ENGINE storages are on gluster volumes. Around 11am, I do not see any specific messages in glusterd.log or glfsheal-*.log. Gluster is on a separate network (2*1G bond mode 4=aggegation) than ovirmgmt (2*1G bond mode 1=failover). -- Regards, Frank

Hi, I forgot to look in the /var/log/messages file on the host ! What a shame :/ Here is the messages file at the time of the error : https://gist.github.com/fsoyer/4d1247d4c3007a8727459efd23d89737 At the sasme time, the second host as no particular messages in its log. Does anyone have an idea of the source problem ? -- Cordialement, Frank Le Mardi, Octobre 16, 2018 13:25 CEST, "fsoyer" <fsoyer@systea.fr> a écrit: Hi all, this morning, some of my VMs were restarted unexpectidly. The events in GUI say : 16 oct. 2018 11:03:50 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:03:26 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:03:23 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:02:53 Highly Available VM op2drugs1 failed. It will be restarted automatically. 16 oct. 2018 11:02:53 Failed to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:53 VM op2drugs1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:53 VM patjoub1 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:47 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:46 Failed to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:46 VM npi2 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:38 Trying to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:37 Highly Available VM patjoub1 failed. It will be restarted automatically. 16 oct. 2018 11:02:37 VM patjoub1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:36 VM patjoub1 is not responding. 16 oct. 2018 11:02:36 VM altern8 is not responding. 16 oct. 2018 11:02:36 VM Sogov3 is not responding. 16 oct. 2018 11:02:36 VM cerbere3 is not responding. 16 oct. 2018 11:02:36 VM Mint19 is not responding. 16 oct. 2018 11:02:35 VM cerbere4 is not responding. 16 oct. 2018 11:02:35 VM zabbix is not responding. 16 oct. 2018 11:02:34 Trying to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:33 Highly Available VM npi2 failed. It will be restarted automatically. 16 oct. 2018 11:02:33 VM npi2 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:20 VM cerbere3 is not responding. 16 oct. 2018 11:02:20 VM logcollector is not responding. 16 oct. 2018 11:02:20 VM HostedEngine is not responding.with engine. log : https://gist.github.com/fsoyer/e3b74b4693006736b4f737b642aed0ef searching for "Failed to acquire lock" I see a post about sanlock.log. Here it is at the time of the restart : https://gist.github.com/fsoyer/8d6952e85623a12f09317652aa4babd7 (hope that you can display this gists) First question : there is all the days those message "delta_renew long write time". What does this mean ? Even if I suspect some storage problem, I don't see latency on it (configuration described bellow). Second question : what append that force some VMs (not all, and not and the sams host !) to restart ? Where and what must I search for ? Thanks Configuration 2 DELL R620 as ovirt hosts (4.2.8-2) with hosted-engine, also members of a gluster 3.12.13-1 cluster with an arbiter (1 DELL R310, non-ovirt). The DATAs and ENGINE storages are on gluster volumes. Around 11am, I do not see any specific messages in glusterd.log or glfsheal-*.log. Gluster is on a separate network (2*1G bond mode 4=aggegation) than ovirmgmt (2*1G bond mode 1=failover). -- Regards, Frank

And I add log of one of the restarted VMs https://gist.github.com/fsoyer/b63daa0653d91a59ffc65f2b6ad263f6 -- Cordialement, Frank Le Jeudi, Octobre 18, 2018 14:41 CEST, "fsoyer" <fsoyer@systea.fr> a écrit: Hi, I forgot to look in the /var/log/messages file on the host ! What a shame :/ Here is the messages file at the time of the error : https://gist.github.com/fsoyer/4d1247d4c3007a8727459efd23d89737 At the sasme time, the second host as no particular messages in its log. Does anyone have an idea of the source problem ? -- Cordialement, Frank Le Mardi, Octobre 16, 2018 13:25 CEST, "fsoyer" <fsoyer@systea.fr> a écrit: Hi all, this morning, some of my VMs were restarted unexpectidly. The events in GUI say : 16 oct. 2018 11:03:50 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:03:26 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:03:23 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:02:53 Highly Available VM op2drugs1 failed. It will be restarted automatically. 16 oct. 2018 11:02:53 Failed to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:53 VM op2drugs1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:53 VM patjoub1 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:47 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:46 Failed to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:46 VM npi2 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:38 Trying to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:37 Highly Available VM patjoub1 failed. It will be restarted automatically. 16 oct. 2018 11:02:37 VM patjoub1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:36 VM patjoub1 is not responding. 16 oct. 2018 11:02:36 VM altern8 is not responding. 16 oct. 2018 11:02:36 VM Sogov3 is not responding. 16 oct. 2018 11:02:36 VM cerbere3 is not responding. 16 oct. 2018 11:02:36 VM Mint19 is not responding. 16 oct. 2018 11:02:35 VM cerbere4 is not responding. 16 oct. 2018 11:02:35 VM zabbix is not responding. 16 oct. 2018 11:02:34 Trying to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:33 Highly Available VM npi2 failed. It will be restarted automatically. 16 oct. 2018 11:02:33 VM npi2 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:20 VM cerbere3 is not responding. 16 oct. 2018 11:02:20 VM logcollector is not responding. 16 oct. 2018 11:02:20 VM HostedEngine is not responding.with engine. log : https://gist.github.com/fsoyer/e3b74b4693006736b4f737b642aed0ef searching for "Failed to acquire lock" I see a post about sanlock.log. Here it is at the time of the restart : https://gist.github.com/fsoyer/8d6952e85623a12f09317652aa4babd7 (hope that you can display this gists) First question : there is all the days those message "delta_renew long write time". What does this mean ? Even if I suspect some storage problem, I don't see latency on it (configuration described bellow). Second question : what append that force some VMs (not all, and not and the sams host !) to restart ? Where and what must I search for ? Thanks Configuration 2 DELL R620 as ovirt hosts (4.2.8-2) with hosted-engine, also members of a gluster 3.12.13-1 cluster with an arbiter (1 DELL R310, non-ovirt). The DATAs and ENGINE storages are on gluster volumes. Around 11am, I do not see any specific messages in glusterd.log or glfsheal-*.log. Gluster is on a separate network (2*1G bond mode 4=aggegation) than ovirmgmt (2*1G bond mode 1=failover). -- Regards, Frank

On Thu, Oct 18, 2018 at 3:43 PM fsoyer <fsoyer@systea.fr> wrote:
Hi, I forgot to look in the /var/log/messages file on the host ! What a shame :/ Here is the messages file at the time of the error : https://gist.github.com/fsoyer/4d1247d4c3007a8727459efd23d89737 At the sasme time, the second host as no particular messages in its log. Does anyone have an idea of the source problem ?
The problem started when sanlock could not renew storage leases held by some processes: Oct 16 11:01:46 victor sanlock[904]: 2018-10-16 11:01:46 2945585 [4167]: s3 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/victor.local.systea.fr: _DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/ids Oct 16 11:01:46 victor sanlock[904]: 2018-10-16 11:01:46 2945585 [4167]: s3 renewal error -202 delta_length 25 last_success 2945539 After 80 seconds, the vms are terminated by sanlock: Oct 16 11:02:19 victor sanlock[904]: 2018-10-16 11:02:18 2945617 [904]: s1 check_our_lease failed 80 Oct 16 11:02:19 victor sanlock[904]: 2018-10-16 11:02:18 2945617 [904]: s1 kill 13823 sig 15 count 1 But process 13823 cannot be killed, since it is blocked on storage, so sanlock send many more TERM signals: Oct 16 11:02:33 victor sanlock[904]: 2018-10-16 11:02:33 2945633 [904]: s1 kill 13823 sig 15 count 17 The VM finally dies after 17 retries: Oct 16 11:02:33 victor sanlock[904]: 2018-10-16 11:02:33 2945633 [904]: dead 13823 ci 10 count 17 We can see the same flow for other processes (HA VMs?) This allows the system to start the HA VM on another host, which is what we see in the events log in the first message. Trying to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:33 Highly Available VM npi2 failed. It will be restarted automatically. 16 oct. 2018 11:02:33 VM npi2 is down with error. Exit message: VM has been terminated on the host. If the VMs were not started successfully on the other hosts, maybe the storage domain used for VM lease is not accessible? It is recommended to choose the same storage domain used by the other VM disks for the VM lease. Also check that all storage domains are accessible - if they are not you will have warnings in /var/log/vdsm/vdsm.log. Nir
--
Cordialement,
*Frank*
Le Mardi, Octobre 16, 2018 13:25 CEST, "fsoyer" <fsoyer@systea.fr> a écrit:
Hi all, this morning, some of my VMs were restarted unexpectidly. The events in GUI say :
16 oct. 2018 11:03:50 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:03:26 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:03:23 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:02:53 Highly Available VM op2drugs1 failed. It will be restarted automatically. 16 oct. 2018 11:02:53 Failed to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:53 VM op2drugs1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:53 VM patjoub1 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:47 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:46 Failed to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:46 VM npi2 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:38 Trying to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:37 Highly Available VM patjoub1 failed. It will be restarted automatically. 16 oct. 2018 11:02:37 VM patjoub1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:36 VM patjoub1 is not responding. 16 oct. 2018 11:02:36 VM altern8 is not responding. 16 oct. 2018 11:02:36 VM Sogov3 is not responding. 16 oct. 2018 11:02:36 VM cerbere3 is not responding. 16 oct. 2018 11:02:36 VM Mint19 is not responding. 16 oct. 2018 11:02:35 VM cerbere4 is not responding. 16 oct. 2018 11:02:35 VM zabbix is not responding. 16 oct. 2018 11:02:34 Trying to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:33 Highly Available VM npi2 failed. It will be restarted automatically. 16 oct. 2018 11:02:33 VM npi2 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:20 VM cerbere3 is not responding. 16 oct. 2018 11:02:20 VM logcollector is not responding. 16 oct. 2018 11:02:20 VM HostedEngine is not responding.
with engine. log : https://gist.github.com/fsoyer/e3b74b4693006736b4f737b642aed0ef searching for "Failed to acquire lock" I see a post about sanlock.log. Here it is at the time of the restart : https://gist.github.com/fsoyer/8d6952e85623a12f09317652aa4babd7 (hope that you can display this gists)
First question : there is all the days those message "delta_renew long write time". What does this mean ? Even if I suspect some storage problem, I don't see latency on it (configuration described bellow). Second question : what append that force some VMs (not all, and not and the sams host !) to restart ? Where and what must I search for ? Thanks
Configuration 2 DELL R620 as ovirt hosts (4.2.8-2) with hosted-engine, also members of a gluster 3.12.13-1 cluster with an arbiter (1 DELL R310, non-ovirt). The DATAs and ENGINE storages are on gluster volumes. Around 11am, I do not see any specific messages in glusterd.log or glfsheal-*.log. Gluster is on a separate network (2*1G bond mode 4=aggegation) than ovirmgmt (2*1G bond mode 1=failover).
--
Regards,
*Frank*
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XFFJT4NORIELIO...

Hi Nir, thank you for this detailed analysis. As I can see, the fist VM to shutdown had its lease on hosted storage domain (probably not the best, maybe a test remained here) and its disk on DATA02. The 3 others (HA VMs) had a lease on the same domain as their disk (DATA02). So I suppose this looks like a gluster latency on DATA02. But what I don't understand at this time is : - if this was a lease problem on DATA02, the VM npi2 should not be impacted... Or DATA02 was inaccessible, and the messages should have reported a storage error (something like "IO error" I suppode ?) - If this was a problem on hosted storage domain too, the engine do not restart (if the domain was off or blocked, it would have?) nor was marked as not responding, even temporarily - Gluster saw absolutly nothing at the same time, on engine domain or DATA02 : the logs of daemons and bricks show nothing revelant. Unfortunatly, I have no more the vdsm log file at the time of the problem : it is rotated+compressed all 2 hours, and I discover that if you uncompress the "vdsm.log.1.xz" for example, at the time of rotation the system overwrite it with the last log :( I'm afraid that I need to wait for another problem to re-scan all the logs and try to understand what append... -- Cordialement, Frank Le Jeudi, Octobre 18, 2018 23:13 CEST, Nir Soffer <nsoffer@redhat.com> a écrit: On Thu, Oct 18, 2018 at 3:43 PM fsoyer <fsoyer@systea.fr> wrote:Hi, I forgot to look in the /var/log/messages file on the host ! What a shame :/ Here is the messages file at the time of the error : https://gist.github.com/fsoyer/4d1247d4c3007a8727459efd23d89737 At the sasme time, the second host as no particular messages in its log. Does anyone have an idea of the source problem ? The problem started when sanlock could not renew storage leases held by some processes: Oct 16 11:01:46 victor sanlock[904]: 2018-10-16 11:01:46 2945585 [4167]: s3 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/idsOct 16 11:01:46 victor sanlock[904]: 2018-10-16 11:01:46 2945585 [4167]: s3 renewal error -202 delta_length 25 last_success 2945539 After 80 seconds, the vms are terminated by sanlock: Oct 16 11:02:19 victor sanlock[904]: 2018-10-16 11:02:18 2945617 [904]: s1 check_our_lease failed 80Oct 16 11:02:19 victor sanlock[904]: 2018-10-16 11:02:18 2945617 [904]: s1 kill 13823 sig 15 count 1 But process 13823 cannot be killed, since it is blocked on storage, so sanlock send many moreTERM signals: Oct 16 11:02:33 victor sanlock[904]: 2018-10-16 11:02:33 2945633 [904]: s1 kill 13823 sig 15 count 17 The VM finally dies after 17 retries: Oct 16 11:02:33 victor sanlock[904]: 2018-10-16 11:02:33 2945633 [904]: dead 13823 ci 10 count 17 We can see the same flow for other processes (HA VMs?) This allows the system to start the HA VMon another host, which is what we see in the events log in the first message. Trying to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:33 Highly Available VM npi2 failed. It will be restarted automatically. 16 oct. 2018 11:02:33 VM npi2 is down with error. Exit message: VM has been terminated on the host. If the VMs were not started successfully on the other hosts, maybe the storage domainused for VM lease is not accessible? It is recommended to choose the same storage domain used by the other VM disks forthe VM lease. Also check that all storage domains are accessible - if they are not you will have warningsin /var/log/vdsm/vdsm.log. Nir -- Cordialement, Frank Le Mardi, Octobre 16, 2018 13:25 CEST, "fsoyer" <fsoyer@systea.fr> a écrit: Hi all, this morning, some of my VMs were restarted unexpectidly. The events in GUI say : 16 oct. 2018 11:03:50 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:03:26 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:03:23 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:02:53 Highly Available VM op2drugs1 failed. It will be restarted automatically. 16 oct. 2018 11:02:53 Failed to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:53 VM op2drugs1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:53 VM patjoub1 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:47 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:46 Failed to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:46 VM npi2 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:38 Trying to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:37 Highly Available VM patjoub1 failed. It will be restarted automatically. 16 oct. 2018 11:02:37 VM patjoub1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:36 VM patjoub1 is not responding. 16 oct. 2018 11:02:36 VM altern8 is not responding. 16 oct. 2018 11:02:36 VM Sogov3 is not responding. 16 oct. 2018 11:02:36 VM cerbere3 is not responding. 16 oct. 2018 11:02:36 VM Mint19 is not responding. 16 oct. 2018 11:02:35 VM cerbere4 is not responding. 16 oct. 2018 11:02:35 VM zabbix is not responding. 16 oct. 2018 11:02:34 Trying to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:33 Highly Available VM npi2 failed. It will be restarted automatically. 16 oct. 2018 11:02:33 VM npi2 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:20 VM cerbere3 is not responding. 16 oct. 2018 11:02:20 VM logcollector is not responding. 16 oct. 2018 11:02:20 VM HostedEngine is not responding.with engine. log : https://gist.github.com/fsoyer/e3b74b4693006736b4f737b642aed0ef searching for "Failed to acquire lock" I see a post about sanlock.log. Here it is at the time of the restart : https://gist.github.com/fsoyer/8d6952e85623a12f09317652aa4babd7 (hope that you can display this gists) First question : there is all the days those message "delta_renew long write time". What does this mean ? Even if I suspect some storage problem, I don't see latency on it (configuration described bellow). Second question : what append that force some VMs (not all, and not and the sams host !) to restart ? Where and what must I search for ? Thanks Configuration 2 DELL R620 as ovirt hosts (4.2.8-2) with hosted-engine, also members of a gluster 3.12.13-1 cluster with an arbiter (1 DELL R310, non-ovirt). The DATAs and ENGINE storages are on gluster volumes. Around 11am, I do not see any specific messages in glusterd.log or glfsheal-*.log. Gluster is on a separate network (2*1G bond mode 4=aggegation) than ovirmgmt (2*1G bond mode 1=failover). -- Regards, Frank _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XFFJT4NORIELIO...

Well guys, I can say now that I have a real problem, maybe between ovirt and gluster storage, but I can't be sure. Yesterday, I wanted to clone a VM (named "crij2") from a snapshot, but (this is another problem I think) the UI never gave me the popup (blank window with the cursor with a message 400 after a timeout). So I decided to export it, then import it. The export/import finally works, but when it was working, some VMs became randomly unresponsives, and one restarted on error. At this time, the engine was on "ginger" node. Copy of the event log : 27 oct. 2018 20:32:12 VM crij2 started on Host victor.local.systea.fr 27 oct. 2018 20:31:37 VM crij2 was started by admin@internal-authz (Host: victor.local.systea.fr). 27 oct. 2018 20:26:39 Vm crij2 was imported successfully to Data Center Default, Cluster Default 27 oct. 2018 20:22:53 VM logcollector is not responding. 27 oct. 2018 20:22:10 VM Sogov3 is not responding. 27 oct. 2018 20:17:53 VM cerbere4 is not responding. 27 oct. 2018 20:17:49 VM cerbere3 is not responding. 27 oct. 2018 20:17:48 VM logcollector is not responding. 27 oct. 2018 20:16:38 VM Sogov3 is not responding. 27 oct. 2018 20:16:38 VM cerbere4 is not responding. 27 oct. 2018 20:16:38 VM op2drugs1 is not responding. 27 oct. 2018 20:16:33 VM cerbere3 is not responding. 27 oct. 2018 20:07:30 VM op2drugs1 is not responding. 27 oct. 2018 20:06:14 VM cerbere3 is not responding. 27 oct. 2018 20:02:27 VM cerbere3 is not responding. 27 oct. 2018 20:01:11 VM logcollector is not responding. 27 oct. 2018 20:00:56 VM zabbix is not responding. 27 oct. 2018 19:57:42 VM zabbix is not responding. 27 oct. 2018 19:57:42 VM cerbere3 is not responding. 27 oct. 2018 19:57:42 VM logcollector is not responding. 27 oct. 2018 19:54:40 VM zabbix is not responding. 27 oct. 2018 19:53:25 VM cerbere3 is not responding. 27 oct. 2018 19:53:25 VM cerbere4 is not responding. 27 oct. 2018 19:48:29 Starting to import Vm crij2 to Data Center Default, Cluster Default 27 oct. 2018 19:47:41 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 19:46:46 VM crij2 was renamed from crij2 to crij2_ok by admin. 27 oct. 2018 19:46:46 VM crij2 configuration was updated by admin@internal-authz. 27 oct. 2018 19:46:12 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 19:42:36 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 19:37:22 Vm crij2 was exported successfully to EXPORT 27 oct. 2018 19:36:04 VM HostedEngine is not responding. 27 oct. 2018 19:33:03 VM op2drugs1 is not responding. 27 oct. 2018 19:32:48 VM altern8 is not responding. 27 oct. 2018 19:32:48 VM patjoub1 is not responding. 27 oct. 2018 19:31:03 VM op2drugs1 is not responding. 27 oct. 2018 19:30:48 VM altern8 is not responding. 27 oct. 2018 19:30:48 VM patjoub1 is not responding. 27 oct. 2018 19:28:37 VM Sogov3 is not responding. 27 oct. 2018 19:28:07 VM altern8 is not responding. 27 oct. 2018 19:28:07 VM op2drugs1 is not responding. 27 oct. 2018 19:28:07 VM patjoub1 is not responding. 27 oct. 2018 19:25:10 VM Mint19 is not responding. 27 oct. 2018 19:25:10 VM zabbix is not responding. 27 oct. 2018 19:24:55 VM HostedEngine is not responding. 27 oct. 2018 19:23:33 VM op2drugs1 is not responding. 27 oct. 2018 19:23:18 VM altern8 is not responding. 27 oct. 2018 19:23:18 VM patjoub1 is not responding. 27 oct. 2018 19:21:52 VM op2drugs1 is not responding. 27 oct. 2018 19:20:06 VM patjoub1 is not responding. 27 oct. 2018 19:19:51 VM Sogov3 is not responding. 27 oct. 2018 19:18:26 Host ginger.local.systea.fr power management was verified successfully. 27 oct. 2018 19:18:26 Status of host ginger.local.systea.fr was set to Up. 27 oct. 2018 19:18:25 Manually synced the storage devices from host ginger.local.systea.fr 27 oct. 2018 19:17:51 Executing power management status on Host ginger.local.systea.fr using Proxy Host victor.local.systea.fr and Fence Agent ipmilan:10.0.0.225. 27 oct. 2018 19:17:39 Host ginger.local.systea.fr is not responding. It will stay in Connecting state for a grace period of 82 seconds and after that an attempt to fence the host will be issued. 27 oct. 2018 19:17:21 VM altern8 is not responding. 27 oct. 2018 19:17:21 Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host ginger.local.systea.fr, Error: Network error during communication with the Host.). 27 oct. 2018 19:17:21 VM patjoub1 is not responding. 27 oct. 2018 19:17:20 VM HostedEngine is not responding. 27 oct. 2018 19:17:20 VM op2drugs1 is not responding. 27 oct. 2018 19:17:19 VDSM ginger.local.systea.fr command SpmStatusVDS failed: Connection timeout for host 'ginger.local.systea.fr', last response arrived 17279 ms ago. 27 oct. 2018 19:16:16 Failed to update VMs/Templates OVF data for Storage Domain DATA02 in Data Center Default. 27 oct. 2018 19:16:16 Failed to update OVF disks 85d67951-d610-49b3-aaab-a81850621e35, OVF data isn't updated on those OVF stores (Data Center Default, Storage Domain DATA02). 27 oct. 2018 19:16:16 VDSM command SetVolumeDescriptionVDS failed: Resource timeout: () 27 oct. 2018 19:16:16 VM patjoub1 is not responding. 27 oct. 2018 19:16:16 VM op2drugs1 is not responding. 27 oct. 2018 19:14:46 VM patjoub1 is not responding. 27 oct. 2018 19:14:46 VM op2drugs1 is not responding. 27 oct. 2018 19:13:18 Host ginger.local.systea.fr power management was verified successfully. 27 oct. 2018 19:13:18 Status of host ginger.local.systea.fr was set to Up. 27 oct. 2018 19:13:03 Manually synced the storage devices from host ginger.local.systea.fr 27 oct. 2018 19:12:51 VM altern8 is not responding. 27 oct. 2018 19:12:51 VM HostedEngine is not responding. 27 oct. 2018 19:12:51 VM op2drugs1 is not responding. 27 oct. 2018 19:12:48 Executing power management status on Host ginger.local.systea.fr using Proxy Host victor.local.systea.fr and Fence Agent ipmilan:10.0.0.225. 27 oct. 2018 19:12:44 Host ginger.local.systea.fr does not enforce SELinux. Current status: DISABLED 27 oct. 2018 19:12:36 Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host ginger.local.systea.fr, Error: Network error during communication with the Host.). 27 oct. 2018 19:12:28 Host ginger.local.systea.fr is not responding. It will stay in Connecting state for a grace period of 82 seconds and after that an attempt to fence the host will be issued. 27 oct. 2018 19:12:28 VDSM ginger.local.systea.fr command SpmStatusVDS failed: Connection timeout for host 'ginger.local.systea.fr', last response arrived 25225 ms ago. 27 oct. 2018 19:10:06 VM altern8 is not responding. 27 oct. 2018 19:10:06 VM patjoub1 is not responding. 27 oct. 2018 19:10:06 VM op2drugs1 is not responding. 27 oct. 2018 19:08:49 VM op2drugs1 is not responding. 27 oct. 2018 19:08:45 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 19:08:34 VM altern8 is not responding. 27 oct. 2018 19:08:34 VM patjoub1 is not responding. 27 oct. 2018 19:08:34 VM HostedEngine is not responding. 27 oct. 2018 19:04:01 VM op2drugs1 is not responding. 27 oct. 2018 19:01:08 VM HostedEngine is not responding. 27 oct. 2018 19:00:53 VM zabbix is not responding. 27 oct. 2018 19:00:01 Trying to restart VM npi2 on Host victor.local.systea.fr 27 oct. 2018 18:59:14 Trying to restart VM npi2 on Host victor.local.systea.fr 27 oct. 2018 18:59:13 Highly Available VM np2 failed. It will be restarted automatically. 27 oct. 2018 18:59:13 VM npi2 is down with error. Exit message: VM has been terminated on the host. 27 oct. 2018 18:59:05 VM altern8 is not responding. 27 oct. 2018 18:58:44 Storage domain DATA02 experienced a high latency of 6.16279 seconds from host ginger.local.systea.fr. This may cause performance and functional issues. Please consult your Storage Administrator. 27 oct. 2018 18:57:19 VM altern8 is not responding. 27 oct. 2018 18:57:19 VM patjoub1 is not responding. 27 oct. 2018 18:57:19 VM HostedEngine is not responding. 27 oct. 2018 18:57:19 VM op2drugs1 is not responding. 27 oct. 2018 18:55:56 VM altern8 is not responding. 27 oct. 2018 18:55:41 VM op2drugs1 is not responding. 27 oct. 2018 18:55:00 VM altern8 is not responding. 27 oct. 2018 18:54:45 VM op2drugs1 is not responding. 27 oct. 2018 18:52:21 VM Sogov3 is not responding. 27 oct. 2018 18:52:21 VM npi2 is not responding. 27 oct. 2018 18:50:50 VM altern8 is not responding. 27 oct. 2018 18:50:47 VM zabbix is not responding. 27 oct. 2018 18:48:16 VM op2drugs1 is not responding. 27 oct. 2018 18:48:03 VM altern8 is not responding. 27 oct. 2018 18:48:03 VM HostedEngine is not responding. 27 oct. 2018 18:45:48 Starting export Vm crij2 to EXPORT 27 oct. 2018 18:42:57 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 18:40:44 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 18:40:04 VM crij2 is down. Exit message: User shut down from within the guest 27 oct. 2018 18:39:25 User admin@internal-authz got disconnected from VM crij2. I checked the network and gluster since it works but saw absolutly nothing special. The storage network was not too sollicited (bwm-ng indicated max 50MB/s on bond1). Gluster log no errors too (even if the engine reported some timeouts). This morning I started to search why and wanted to submit to you some logs on this thread, but I found something that had not caught my attention before, so I ask about it before all. I recall the configuration : 3 hosts with gluster (replica 2 + arbiter). The volumes are on a separate network (bond1 is an aggregation of 2 Gb cards while ovirmgmt is on bond0, 2 NICs in backup mode). For now, I have only declared the first 2 nodes in the engine GUI as ovirt nodes, because the arbiter is a small machine with a smaller CPU (and only 8Gb memory), that needed to downgrade the cluster from Sandybridge to Nehalem. Maybe it was an error. The storagenetwork on bond1 was declared too in the GYUI, but not yet as a gluster storage. The Gluster volumes themselves were declared on the storage network by using names indicated in /etc/hosts on bond1 network. Here is a volume status : # gluster volume status Status of volume: DATA01 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata01/data01/brick 49152 0 Y 2489 Brick gingerstorage.local.systea.fr:/home/d ata01/data01/brick 49152 0 Y 2531 Brick eskarinastorage.local.systea.fr:/home /data01/data01/brick 49153 0 Y 28119 Self-heal Daemon on localhost N/A N/A Y 24859 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 30725 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 2810 Task Status of Volume DATA01 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: DATA02 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata02/data02/brick 49153 0 Y 2553 Brick gingerstorage.local.systea.fr:/home/d ata02/data02/brick 49153 0 Y 2561 Brick eskarinastorage.local.systea.fr:/home /data01/data02/brick 49154 0 Y 28204 Self-heal Daemon on localhost N/A N/A Y 24859 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 30725 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 2810 Task Status of Volume DATA02 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: ENGINE Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata02/engine/brick 49154 0 Y 2571 Brick gingerstorage.local.systea.fr:/home/d ata02/engine/brick 49154 0 Y 2610 Brick eskarinastorage.local.systea.fr:/home /data01/engine/brick 49152 0 Y 28013 Self-heal Daemon on localhost N/A N/A Y 24859 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 30725 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 2810 Task Status of Volume ENGINE ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: EXPORT Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata01/export/brick 49155 0 Y 2588 Brick gingerstorage.local.systea.fr:/home/d ata01/export/brick 49155 0 Y 2629 Brick eskarinastorage.local.systea.fr:/home /data01/export/brick 49156 0 Y 28384 Self-heal Daemon on localhost N/A N/A Y 24859 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 30725 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 2810 Task Status of Volume EXPORT ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: ISO Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata01/iso/brick 49156 0 Y 2595 Brick gingerstorage.local.systea.fr:/home/d ata01/iso/brick 49156 0 Y 2636 Brick eskarinastorage.local.systea.fr:/home /data01/iso/brick 49155 0 Y 28292 Self-heal Daemon on localhost N/A N/A Y 24859 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 30725 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 2810 Task Status of Volume ISO ------------------------------- But, a df on the nodes shows that all volumes except ENGINE was mounted on ovirmgmt network (hosts names without "storage") : gingerstorage.local.systea.fr:/ENGINE 5,0T 226G 4,7T 5% /rhev/data-center/mnt/glusterSD/gingerstorage.local.systea.fr:_ENGINE victor.local.systea.fr:/DATA01 1,3T 425G 862G 33% /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA01 victor.local.systea.fr:/DATA02 5,0T 226G 4,7T 5% /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02 victor.local.systea.fr:/ISO 1,3T 425G 862G 33% /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_ISO victor.local.systea.fr:/EXPORT 1,3T 425G 862G 33% /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_EXPORT I can't remember how it was declared at install time, maybe I had not seen that, but if I tried to had a domain now, gluster managed, effectively it proposes to me only the nodes by their ovirmgmt names, not storage names. Names are only known in the /etc/hosts of all nodes + engine, there is no DNS for this local addresses. So : in your opinion, can this configuration be a (the) source of my problems ? And have you an idea how I could correct this now, without loosing anything ? Thanks for all suggestions. -- Regards, Frank Le Jeudi, Octobre 18, 2018 23:13 CEST, Nir Soffer <nsoffer@redhat.com> a écrit: On Thu, Oct 18, 2018 at 3:43 PM fsoyer <fsoyer@systea.fr> wrote:Hi, I forgot to look in the /var/log/messages file on the host ! What a shame :/ Here is the messages file at the time of the error : https://gist.github.com/fsoyer/4d1247d4c3007a8727459efd23d89737 At the sasme time, the second host as no particular messages in its log. Does anyone have an idea of the source problem ? The problem started when sanlock could not renew storage leases held by some processes: Oct 16 11:01:46 victor sanlock[904]: 2018-10-16 11:01:46 2945585 [4167]: s3 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02/ffc53fd8-c5d1-4070-ae51-2e91835cd937/dom_md/idsOct 16 11:01:46 victor sanlock[904]: 2018-10-16 11:01:46 2945585 [4167]: s3 renewal error -202 delta_length 25 last_success 2945539 After 80 seconds, the vms are terminated by sanlock: Oct 16 11:02:19 victor sanlock[904]: 2018-10-16 11:02:18 2945617 [904]: s1 check_our_lease failed 80Oct 16 11:02:19 victor sanlock[904]: 2018-10-16 11:02:18 2945617 [904]: s1 kill 13823 sig 15 count 1 But process 13823 cannot be killed, since it is blocked on storage, so sanlock send many moreTERM signals: Oct 16 11:02:33 victor sanlock[904]: 2018-10-16 11:02:33 2945633 [904]: s1 kill 13823 sig 15 count 17 The VM finally dies after 17 retries: Oct 16 11:02:33 victor sanlock[904]: 2018-10-16 11:02:33 2945633 [904]: dead 13823 ci 10 count 17 We can see the same flow for other processes (HA VMs?) This allows the system to start the HA VMon another host, which is what we see in the events log in the first message. Trying to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:33 Highly Available VM npi2 failed. It will be restarted automatically. 16 oct. 2018 11:02:33 VM npi2 is down with error. Exit message: VM has been terminated on the host. If the VMs were not started successfully on the other hosts, maybe the storage domainused for VM lease is not accessible? It is recommended to choose the same storage domain used by the other VM disks forthe VM lease. Also check that all storage domains are accessible - if they are not you will have warningsin /var/log/vdsm/vdsm.log. Nir -- Cordialement, Frank Le Mardi, Octobre 16, 2018 13:25 CEST, "fsoyer" <fsoyer@systea.fr> a écrit: Hi all, this morning, some of my VMs were restarted unexpectidly. The events in GUI say : 16 oct. 2018 11:03:50 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:03:26 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:03:23 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM op2drugs1 on Host victor.local.systea.fr 16 oct. 2018 11:02:54 Trying to restart VM patjoub1 on Host ginger.local.systea.fr 16 oct. 2018 11:02:53 Highly Available VM op2drugs1 failed. It will be restarted automatically. 16 oct. 2018 11:02:53 Failed to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:53 VM op2drugs1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:53 VM patjoub1 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:47 Trying to restart VM npi2 on Host ginger.local.systea.fr 16 oct. 2018 11:02:46 Failed to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:46 VM npi2 is down with error. Exit message: Failed to acquire lock: Aucun espace disponible sur le périphérique. 16 oct. 2018 11:02:38 Trying to restart VM patjoub1 on Host victor.local.systea.fr 16 oct. 2018 11:02:37 Highly Available VM patjoub1 failed. It will be restarted automatically. 16 oct. 2018 11:02:37 VM patjoub1 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:36 VM patjoub1 is not responding. 16 oct. 2018 11:02:36 VM altern8 is not responding. 16 oct. 2018 11:02:36 VM Sogov3 is not responding. 16 oct. 2018 11:02:36 VM cerbere3 is not responding. 16 oct. 2018 11:02:36 VM Mint19 is not responding. 16 oct. 2018 11:02:35 VM cerbere4 is not responding. 16 oct. 2018 11:02:35 VM zabbix is not responding. 16 oct. 2018 11:02:34 Trying to restart VM npi2 on Host victor.local.systea.fr 16 oct. 2018 11:02:33 Highly Available VM npi2 failed. It will be restarted automatically. 16 oct. 2018 11:02:33 VM npi2 is down with error. Exit message: VM has been terminated on the host. 16 oct. 2018 11:02:20 VM cerbere3 is not responding. 16 oct. 2018 11:02:20 VM logcollector is not responding. 16 oct. 2018 11:02:20 VM HostedEngine is not responding.with engine. log : https://gist.github.com/fsoyer/e3b74b4693006736b4f737b642aed0ef searching for "Failed to acquire lock" I see a post about sanlock.log. Here it is at the time of the restart : https://gist.github.com/fsoyer/8d6952e85623a12f09317652aa4babd7 (hope that you can display this gists) First question : there is all the days those message "delta_renew long write time". What does this mean ? Even if I suspect some storage problem, I don't see latency on it (configuration described bellow). Second question : what append that force some VMs (not all, and not and the sams host !) to restart ? Where and what must I search for ? Thanks Configuration 2 DELL R620 as ovirt hosts (4.2.8-2) with hosted-engine, also members of a gluster 3.12.13-1 cluster with an arbiter (1 DELL R310, non-ovirt). The DATAs and ENGINE storages are on gluster volumes. Around 11am, I do not see any specific messages in glusterd.log or glfsheal-*.log. Gluster is on a separate network (2*1G bond mode 4=aggegation) than ovirmgmt (2*1G bond mode 1=failover). -- Regards, Frank _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XFFJT4NORIELIO...

On Sun, Oct 28, 2018 at 5:17 PM fsoyer <fsoyer@systea.fr> wrote:
Well guys, I can say now that I have a real problem, maybe between ovirt and gluster storage, but I can't be sure. Yesterday, I wanted to clone a VM (named "crij2") from a snapshot, but (this is another problem I think) the UI never gave me the popup (blank window with the cursor with a message 400 after a timeout). So I decided to export it, then import it. The export/import finally works, but when it was working, some VMs became randomly unresponsives, and one restarted on error. At this time, the engine was on "ginger" node. Copy of the event log : 27 oct. 2018 20:32:12 VM crij2 started on Host victor.local.systea.fr 27 oct. 2018 20:31:37 VM crij2 was started by admin@internal-authz (Host: victor.local.systea.fr). 27 oct. 2018 20:26:39 Vm crij2 was imported successfully to Data Center Default, Cluster Default 27 oct. 2018 20:22:53 VM logcollector is not responding. 27 oct. 2018 20:22:10 VM Sogov3 is not responding. 27 oct. 2018 20:17:53 VM cerbere4 is not responding. 27 oct. 2018 20:17:49 VM cerbere3 is not responding. 27 oct. 2018 20:17:48 VM logcollector is not responding. 27 oct. 2018 20:16:38 VM Sogov3 is not responding. 27 oct. 2018 20:16:38 VM cerbere4 is not responding. 27 oct. 2018 20:16:38 VM op2drugs1 is not responding. 27 oct. 2018 20:16:33 VM cerbere3 is not responding. 27 oct. 2018 20:07:30 VM op2drugs1 is not responding. 27 oct. 2018 20:06:14 VM cerbere3 is not responding. 27 oct. 2018 20:02:27 VM cerbere3 is not responding. 27 oct. 2018 20:01:11 VM logcollector is not responding. 27 oct. 2018 20:00:56 VM zabbix is not responding. 27 oct. 2018 19:57:42 VM zabbix is not responding. 27 oct. 2018 19:57:42 VM cerbere3 is not responding. 27 oct. 2018 19:57:42 VM logcollector is not responding. 27 oct. 2018 19:54:40 VM zabbix is not responding. 27 oct. 2018 19:53:25 VM cerbere3 is not responding. 27 oct. 2018 19:53:25 VM cerbere4 is not responding. 27 oct. 2018 19:48:29 Starting to import Vm crij2 to Data Center Default, Cluster Default 27 oct. 2018 19:47:41 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 19:46:46 VM crij2 was renamed from crij2 to crij2_ok by admin. 27 oct. 2018 19:46:46 VM crij2 configuration was updated by admin@internal-authz. 27 oct. 2018 19:46:12 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 19:42:36 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 19:37:22 Vm crij2 was exported successfully to EXPORT 27 oct. 2018 19:36:04 VM HostedEngine is not responding. 27 oct. 2018 19:33:03 VM op2drugs1 is not responding. 27 oct. 2018 19:32:48 VM altern8 is not responding. 27 oct. 2018 19:32:48 VM patjoub1 is not responding. 27 oct. 2018 19:31:03 VM op2drugs1 is not responding. 27 oct. 2018 19:30:48 VM altern8 is not responding. 27 oct. 2018 19:30:48 VM patjoub1 is not responding. 27 oct. 2018 19:28:37 VM Sogov3 is not responding. 27 oct. 2018 19:28:07 VM altern8 is not responding. 27 oct. 2018 19:28:07 VM op2drugs1 is not responding. 27 oct. 2018 19:28:07 VM patjoub1 is not responding. 27 oct. 2018 19:25:10 VM Mint19 is not responding. 27 oct. 2018 19:25:10 VM zabbix is not responding. 27 oct. 2018 19:24:55 VM HostedEngine is not responding. 27 oct. 2018 19:23:33 VM op2drugs1 is not responding. 27 oct. 2018 19:23:18 VM altern8 is not responding. 27 oct. 2018 19:23:18 VM patjoub1 is not responding. 27 oct. 2018 19:21:52 VM op2drugs1 is not responding. 27 oct. 2018 19:20:06 VM patjoub1 is not responding. 27 oct. 2018 19:19:51 VM Sogov3 is not responding. 27 oct. 2018 19:18:26 Host ginger.local.systea.fr power management was verified successfully. 27 oct. 2018 19:18:26 Status of host ginger.local.systea.fr was set to Up. 27 oct. 2018 19:18:25 Manually synced the storage devices from host ginger.local.systea.fr 27 oct. 2018 19:17:51 Executing power management status on Host ginger.local.systea.fr using Proxy Host victor.local.systea.fr and Fence Agent ipmilan:10.0.0.225. 27 oct. 2018 19:17:39 Host ginger.local.systea.fr is not responding. It will stay in Connecting state for a grace period of 82 seconds and after that an attempt to fence the host will be issued. 27 oct. 2018 19:17:21 VM altern8 is not responding. 27 oct. 2018 19:17:21 Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host ginger.local.systea.fr, Error: Network error during communication with the Host.). 27 oct. 2018 19:17:21 VM patjoub1 is not responding. 27 oct. 2018 19:17:20 VM HostedEngine is not responding. 27 oct. 2018 19:17:20 VM op2drugs1 is not responding. 27 oct. 2018 19:17:19 VDSM ginger.local.systea.fr command SpmStatusVDS failed: Connection timeout for host 'ginger.local.systea.fr', last response arrived 17279 ms ago. 27 oct. 2018 19:16:16 Failed to update VMs/Templates OVF data for Storage Domain DATA02 in Data Center Default. 27 oct. 2018 19:16:16 Failed to update OVF disks 85d67951-d610-49b3-aaab-a81850621e35, OVF data isn't updated on those OVF stores (Data Center Default, Storage Domain DATA02). 27 oct. 2018 19:16:16 VDSM command SetVolumeDescriptionVDS failed: Resource timeout: () 27 oct. 2018 19:16:16 VM patjoub1 is not responding. 27 oct. 2018 19:16:16 VM op2drugs1 is not responding. 27 oct. 2018 19:14:46 VM patjoub1 is not responding. 27 oct. 2018 19:14:46 VM op2drugs1 is not responding. 27 oct. 2018 19:13:18 Host ginger.local.systea.fr power management was verified successfully. 27 oct. 2018 19:13:18 Status of host ginger.local.systea.fr was set to Up. 27 oct. 2018 19:13:03 Manually synced the storage devices from host ginger.local.systea.fr 27 oct. 2018 19:12:51 VM altern8 is not responding. 27 oct. 2018 19:12:51 VM HostedEngine is not responding. 27 oct. 2018 19:12:51 VM op2drugs1 is not responding. 27 oct. 2018 19:12:48 Executing power management status on Host ginger.local.systea.fr using Proxy Host victor.local.systea.fr and Fence Agent ipmilan:10.0.0.225. 27 oct. 2018 19:12:44 Host ginger.local.systea.fr does not enforce SELinux. Current status: DISABLED 27 oct. 2018 19:12:36 Invalid status on Data Center Default. Setting Data Center status to Non Responsive (On host ginger.local.systea.fr, Error: Network error during communication with the Host.). 27 oct. 2018 19:12:28 Host ginger.local.systea.fr is not responding. It will stay in Connecting state for a grace period of 82 seconds and after that an attempt to fence the host will be issued. 27 oct. 2018 19:12:28 VDSM ginger.local.systea.fr command SpmStatusVDS failed: Connection timeout for host 'ginger.local.systea.fr', last response arrived 25225 ms ago. 27 oct. 2018 19:10:06 VM altern8 is not responding. 27 oct. 2018 19:10:06 VM patjoub1 is not responding. 27 oct. 2018 19:10:06 VM op2drugs1 is not responding. 27 oct. 2018 19:08:49 VM op2drugs1 is not responding. 27 oct. 2018 19:08:45 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 19:08:34 VM altern8 is not responding. 27 oct. 2018 19:08:34 VM patjoub1 is not responding. 27 oct. 2018 19:08:34 VM HostedEngine is not responding. 27 oct. 2018 19:04:01 VM op2drugs1 is not responding. 27 oct. 2018 19:01:08 VM HostedEngine is not responding. 27 oct. 2018 19:00:53 VM zabbix is not responding. 27 oct. 2018 19:00:01 Trying to restart VM npi2 on Host victor.local.systea.fr 27 oct. 2018 18:59:14 Trying to restart VM npi2 on Host victor.local.systea.fr 27 oct. 2018 18:59:13 Highly Available VM np2 failed. It will be restarted automatically. 27 oct. 2018 18:59:13 VM npi2 is down with error. Exit message: VM has been terminated on the host. 27 oct. 2018 18:59:05 VM altern8 is not responding. 27 oct. 2018 18:58:44 Storage domain DATA02 experienced a high latency of 6.16279 seconds from host ginger.local.systea.fr. This may cause performance and functional issues. Please consult your Storage Administrator. 27 oct. 2018 18:57:19 VM altern8 is not responding. 27 oct. 2018 18:57:19 VM patjoub1 is not responding. 27 oct. 2018 18:57:19 VM HostedEngine is not responding. 27 oct. 2018 18:57:19 VM op2drugs1 is not responding. 27 oct. 2018 18:55:56 VM altern8 is not responding. 27 oct. 2018 18:55:41 VM op2drugs1 is not responding. 27 oct. 2018 18:55:00 VM altern8 is not responding. 27 oct. 2018 18:54:45 VM op2drugs1 is not responding. 27 oct. 2018 18:52:21 VM Sogov3 is not responding. 27 oct. 2018 18:52:21 VM npi2 is not responding. 27 oct. 2018 18:50:50 VM altern8 is not responding. 27 oct. 2018 18:50:47 VM zabbix is not responding. 27 oct. 2018 18:48:16 VM op2drugs1 is not responding. 27 oct. 2018 18:48:03 VM altern8 is not responding. 27 oct. 2018 18:48:03 VM HostedEngine is not responding. 27 oct. 2018 18:45:48 Starting export Vm crij2 to EXPORT 27 oct. 2018 18:42:57 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 18:40:44 Refresh image list succeeded for domain(s): ISO (ISO file type) 27 oct. 2018 18:40:04 VM crij2 is down. Exit message: User shut down from within the guest 27 oct. 2018 18:39:25 User admin@internal-authz got disconnected from VM crij2. I checked the network and gluster since it works but saw absolutly nothing special. The storage network was not too sollicited (bwm-ng indicated max 50MB/s on bond1). Gluster log no errors too (even if the engine reported some timeouts).
This morning I started to search why and wanted to submit to you some logs on this thread, but I found something that had not caught my attention before, so I ask about it before all.
I recall the configuration : 3 hosts with gluster (replica 2 + arbiter). The volumes are on a separate network (bond1 is an aggregation of 2 Gb cards while ovirmgmt is on bond0, 2 NICs in backup mode). For now, I have only declared the first 2 nodes in the engine GUI as ovirt nodes, because the arbiter is a small machine with a smaller CPU (and only 8Gb memory), that needed to downgrade the cluster from Sandybridge to Nehalem. Maybe it was an error. The storagenetwork on bond1 was declared too in the GYUI, but not yet as a gluster storage.
The Gluster volumes themselves were declared on the storage network by using names indicated in /etc/hosts on bond1 network. Here is a volume status : # gluster volume status Status of volume: DATA01 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata01/data01/brick 49152 0 Y 2489 Brick gingerstorage.local.systea.fr:/home/d ata01/data01/brick 49152 0 Y 2531 Brick eskarinastorage.local.systea.fr:/home /data01/data01/brick 49153 0 Y 28119 Self-heal Daemon on localhost N/A N/A Y 24859 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 30725 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 2810
Task Status of Volume DATA01 ------------------------------------------------------------------------------ There are no active volume tasks
Status of volume: DATA02 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata02/data02/brick 49153 0 Y 2553 Brick gingerstorage.local.systea.fr:/home/d ata02/data02/brick 49153 0 Y 2561 Brick eskarinastorage.local.systea.fr:/home /data01/data02/brick 49154 0 Y 28204 Self-heal Daemon on localhost N/A N/A Y 24859 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 30725 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 2810
Task Status of Volume DATA02 ------------------------------------------------------------------------------ There are no active volume tasks
Status of volume: ENGINE Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata02/engine/brick 49154 0 Y 2571 Brick gingerstorage.local.systea.fr:/home/d ata02/engine/brick 49154 0 Y 2610 Brick eskarinastorage.local.systea.fr:/home /data01/engine/brick 49152 0 Y 28013 Self-heal Daemon on localhost N/A N/A Y 24859 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 30725 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 2810
Task Status of Volume ENGINE ------------------------------------------------------------------------------ There are no active volume tasks
Status of volume: EXPORT Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata01/export/brick 49155 0 Y 2588 Brick gingerstorage.local.systea.fr:/home/d ata01/export/brick 49155 0 Y 2629 Brick eskarinastorage.local.systea.fr:/home /data01/export/brick 49156 0 Y 28384 Self-heal Daemon on localhost N/A N/A Y 24859 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 30725 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 2810
Task Status of Volume EXPORT ------------------------------------------------------------------------------ There are no active volume tasks
Status of volume: ISO Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick victorstorage.local.systea.fr:/home/d ata01/iso/brick 49156 0 Y 2595 Brick gingerstorage.local.systea.fr:/home/d ata01/iso/brick 49156 0 Y 2636 Brick eskarinastorage.local.systea.fr:/home /data01/iso/brick 49155 0 Y 28292 Self-heal Daemon on localhost N/A N/A Y 24859 Self-heal Daemon on eskarinastorage.local.s ystea.fr N/A N/A Y 30725 Self-heal Daemon on victorstorage.local.sys tea.fr N/A N/A Y 2810
Task Status of Volume ISO ------------------------------- But, a df on the nodes shows that all volumes except ENGINE was mounted on ovirmgmt network (hosts names without "storage") :
gingerstorage.local.systea.fr:/ENGINE 5,0T 226G 4,7T 5% /rhev/data-center/mnt/glusterSD/gingerstorage.local.systea.fr:_ENGINE victor.local.systea.fr:/DATA01 1,3T 425G 862G 33% /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA01 victor.local.systea.fr:/DATA02 5,0T 226G 4,7T 5% /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02 victor.local.systea.fr:/ISO 1,3T 425G 862G 33% /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_ISO victor.local.systea.fr:/EXPORT 1,3T 425G 862G 33% /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_EXPORT
I can't remember how it was declared at install time, maybe I had not seen that, but if I tried to had a domain now, gluster managed, effectively it proposes to me only the nodes by their ovirmgmt names, not storage names.
Names are only known in the /etc/hosts of all nodes + engine, there is no DNS for this local addresses.
So : in your opinion, can this configuration be a (the) source of my problems ? And have you an idea how I could correct this now, without loosing anything ?
I don't think this is the cause of your issues. Are there errors in vdsm logs? Do you have issues with storage latency (can you check the gluster volume profile output?)
Thanks for all suggestions.
--
Regards,
Frank
Le Jeudi, Octobre 18, 2018 23:13 CEST, Nir Soffer <nsoffer@redhat.com> a écrit: On Thu, Oct 18, 2018 at 3:43 PM fsoyer <fsoyer@systea.fr> wrote:Hi, I forgot to look in the /var/log/messages file on the host ! What a shame :/ Here is the messages file at the time of the error : https://gist.github.com/fsoyer/4d1247d4c3007a8727459efd23d89737 At the sasme time, the second host as no particular messages in its log
participants (4)
-
fsoyer
-
fsoyer@systea.fr
-
Nir Soffer
-
Sahina Bose