Forced restart when losing communication with the Storages

Good morning everyone! Is there a way to disable the forced reboot of the machines? This morning there was an event in our infrastructure where the hosts lost communication with the Storage but this caused all the hosts to restart abruptly. Would this be the correct behavior of oVirt? Is there any way to disable this?

Hi, I would say that you observed ‘fencing’ and not SSH soft fencing, but actual reboot via IPMI. https://www.ovirt.org/develop/developer-guide/engine/automatic-fencing.html You can disable Power management for hosts. Before doing that you need to understand following: -what is impact on VMs when this happens? -the working assumption is that your VMs work just fine, but you need to think about other cases where VMs lose their storage and/or network. For me it seems that this was storage domain that is not a VM storage domain, so VMs’ disks were just fine. Maybe it was hosted_storage domain in your case… -any of those VMs are High-availability VMs? Once you disable Power Management you will not have automatic restart on different hosts of those. You need to understand that idea of fencing is either to recover host automatically and possibly to restart VMs and make sure that there are no duplicated VMs. There are 100% cases where fencing is used and there is subset of those, X% number of cases where you would consider that behavior is suboptimal. The drawback of disabling fencing is that you might get suboptimal behavior in Y% cases (100% minus X%) BR, Konstantin From: Murilo Morais <murilo@evocorp.com.br> Date: Wednesday, 30 November 2022 at 12:13 To: users <users@ovirt.org> Subject: [ovirt-users] Forced restart when losing communication with the Storages Good morning everyone! Is there a way to disable the forced reboot of the machines? This morning there was an event in our infrastructure where the hosts lost communication with the Storage but this caused all the hosts to restart abruptly. Would this be the correct behavior of oVirt? Is there any way to disable this?

Konstantin, thank you very much for the explanation, it was very enlightening. I believe I left something open in the previous message. I'm using Hosted Engine, all VMs have HA enabled and Power Management is disabled on all hosts. No IPMI configured (at least I didn't configure anything about iLO/IPMI in oVirt). There was a loss of communication with the Storage for approximately 3 minutes and this caused all Hosts to reboot. Em qua., 30 de nov. de 2022 às 08:50, Volenbovskyi, Konstantin < Konstantin.Volenbovskyi@haufe.com> escreveu:
Hi,
I would say that you observed ‘fencing’ and not SSH soft fencing, but actual reboot via IPMI.
https://www.ovirt.org/develop/developer-guide/engine/automatic-fencing.html
You can disable Power management for hosts.
Before doing that you need to understand following:
-what is impact on VMs when this happens?
-the working assumption is that your VMs work just fine, but you need to think about other cases where VMs lose their storage and/or network.
For me it seems that this was storage domain that is not a VM storage domain, so VMs’ disks were just fine.
Maybe it was hosted_storage domain in your case…
-any of those VMs are High-availability VMs? Once you disable Power Management you will not have automatic restart on different hosts of those.
You need to understand that idea of fencing is either to recover host automatically and possibly to restart VMs
and make sure that there are no duplicated VMs.
There are 100% cases where fencing is used and there is subset of those, X% number of cases where you would consider that behavior is suboptimal.
The drawback of disabling fencing is that you might get suboptimal behavior in Y% cases (100% minus X%)
BR,
Konstantin
*From: *Murilo Morais <murilo@evocorp.com.br> *Date: *Wednesday, 30 November 2022 at 12:13 *To: *users <users@ovirt.org> *Subject: *[ovirt-users] Forced restart when losing communication with the Storages
Good morning everyone!
Is there a way to disable the forced reboot of the machines? This morning there was an event in our infrastructure where the hosts lost communication with the Storage but this caused all the hosts to restart abruptly.
Would this be the correct behavior of oVirt? Is there any way to disable this?

Hello, I've seen something similar to this too. Although, for me it occurred when a standalone engine attempted to allocate a new disk on a Gluster storage. While the cluster's VMs were experiencing high virtual disk I/O. (Found out later they were doing updates at an odd time...) The result was random VMs being forced off until it had cleared enough of the bottleneck, and one host was rebooted. After around 3 minutes of wait time. I'm assuming it used ssh as the hosts in question have a configuration problem with their power management and cannot be reset currently by the PDU. But it was still an odd occurrence given that the engine host itself was the cause of the storage "outage." Is this the correct behavior of oVirt? -Patrick Hibbs On 11/30/22 07:45, Murilo Morais wrote:
Konstantin, thank you very much for the explanation, it was very enlightening.
I believe I left something open in the previous message.
I'm using Hosted Engine, all VMs have HA enabled and Power Management is disabled on all hosts. No IPMI configured (at least I didn't configure anything about iLO/IPMI in oVirt).
There was a loss of communication with the Storage for approximately 3 minutes and this caused all Hosts to reboot.
Em qua., 30 de nov. de 2022 às 08:50, Volenbovskyi, Konstantin <Konstantin.Volenbovskyi@haufe.com> escreveu:
Hi,
I would say that you observed ‘fencing’ and not SSH soft fencing, but actual reboot via IPMI.
https://www.ovirt.org/develop/developer-guide/engine/automatic-fencing.html
You can disable Power management for hosts.
Before doing that you need to understand following:
-what is impact on VMs when this happens?
-the working assumption is that your VMs work just fine, but you need to think about other cases where VMs lose their storage and/or network.
For me it seems that this was storage domain that is not a VM storage domain, so VMs’ disks were just fine.
Maybe it was hosted_storage domain in your case…
-any of those VMs are High-availability VMs? Once you disable Power Management you will not have automatic restart on different hosts of those.
You need to understand that idea of fencing is either to recover host automatically and possibly to restart VMs
and make sure that there are no duplicated VMs.
There are 100% cases where fencing is used and there is subset of those, X% number of cases where you would consider that behavior is suboptimal.
The drawback of disabling fencing is that you might get suboptimal behavior in Y% cases (100% minus X%)
BR,
Konstantin
*From: *Murilo Morais <murilo@evocorp.com.br> *Date: *Wednesday, 30 November 2022 at 12:13 *To: *users <users@ovirt.org> *Subject: *[ovirt-users] Forced restart when losing communication with the Storages
Good morning everyone!
Is there a way to disable the forced reboot of the machines? This morning there was an event in our infrastructure where the hosts lost communication with the Storage but this caused all the hosts to restart abruptly.
Would this be the correct behavior of oVirt? Is there any way to disable this?
_______________________________________________ Users mailing list --users@ovirt.org To unsubscribe send an email tousers-leave@ovirt.org Privacy Statement:https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct:https://www.ovirt.org/community/about/community-guidelines/ List Archives:https://lists.ovirt.org/archives/list/users@ovirt.org/message/JVNIMYBAXJE3YT...

Hi, “I'm assuming it used ssh as the hosts in question have a configuration problem with their power management and cannot be reset currently by the PDU” It is interesting to find the events in logs on ovengine really doing reboot of ovirt hosts via SSH in this case/any cases. I guess that /var/log/messages and probably some ovengine logs should provide evidence if it is the case (and ovengine logs can contain additional information, what is functionality behind ‘restart host that lost a connection to storage domain’ BR, Konstantin From: Patrick Hibbs <hibbsncc1701@gmail.com> Date: Wednesday, 30 November 2022 at 16:31 To: "users@ovirt.org" <users@ovirt.org> Subject: [ovirt-users] Re: Forced restart when losing communication with the Storages Hello, I've seen something similar to this too. Although, for me it occurred when a standalone engine attempted to allocate a new disk on a Gluster storage. While the cluster's VMs were experiencing high virtual disk I/O. (Found out later they were doing updates at an odd time...) The result was random VMs being forced off until it had cleared enough of the bottleneck, and one host was rebooted. After around 3 minutes of wait time. I'm assuming it used ssh as the hosts in question have a configuration problem with their power management and cannot be reset currently by the PDU. But it was still an odd occurrence given that the engine host itself was the cause of the storage "outage." Is this the correct behavior of oVirt? -Patrick Hibbs On 11/30/22 07:45, Murilo Morais wrote: Konstantin, thank you very much for the explanation, it was very enlightening. I believe I left something open in the previous message. I'm using Hosted Engine, all VMs have HA enabled and Power Management is disabled on all hosts. No IPMI configured (at least I didn't configure anything about iLO/IPMI in oVirt). There was a loss of communication with the Storage for approximately 3 minutes and this caused all Hosts to reboot. Em qua., 30 de nov. de 2022 às 08:50, Volenbovskyi, Konstantin <Konstantin.Volenbovskyi@haufe.com<mailto:Konstantin.Volenbovskyi@haufe.com>> escreveu: Hi, I would say that you observed ‘fencing’ and not SSH soft fencing, but actual reboot via IPMI. https://www.ovirt.org/develop/developer-guide/engine/automatic-fencing.html You can disable Power management for hosts. Before doing that you need to understand following: -what is impact on VMs when this happens? -the working assumption is that your VMs work just fine, but you need to think about other cases where VMs lose their storage and/or network. For me it seems that this was storage domain that is not a VM storage domain, so VMs’ disks were just fine. Maybe it was hosted_storage domain in your case… -any of those VMs are High-availability VMs? Once you disable Power Management you will not have automatic restart on different hosts of those. You need to understand that idea of fencing is either to recover host automatically and possibly to restart VMs and make sure that there are no duplicated VMs. There are 100% cases where fencing is used and there is subset of those, X% number of cases where you would consider that behavior is suboptimal. The drawback of disabling fencing is that you might get suboptimal behavior in Y% cases (100% minus X%) BR, Konstantin From: Murilo Morais <murilo@evocorp.com.br<mailto:murilo@evocorp.com.br>> Date: Wednesday, 30 November 2022 at 12:13 To: users <users@ovirt.org<mailto:users@ovirt.org>> Subject: [ovirt-users] Forced restart when losing communication with the Storages Good morning everyone! Is there a way to disable the forced reboot of the machines? This morning there was an event in our infrastructure where the hosts lost communication with the Storage but this caused all the hosts to restart abruptly. Would this be the correct behavior of oVirt? Is there any way to disable this? _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/JVNIMYBAXJE3YT...

Unfortunately, I haven't found anything like this. At least this type of event didn't happen again, but I just can't find anything in the logs that justifies these reboots, I've read everything line by line, if anything was recorded there it passed my eyes. Em qua., 30 de nov. de 2022 às 14:20, Volenbovskyi, Konstantin < Konstantin.Volenbovskyi@haufe.com> escreveu:
Hi,
“I'm assuming it used ssh as the hosts in question have a configuration problem with their power management and cannot be reset currently by the PDU ”
It is interesting to find the events in logs on ovengine really doing reboot of ovirt hosts via SSH in this case/any cases.
I guess that /var/log/messages and probably some ovengine logs should provide evidence if it is the case (and ovengine logs can contain additional information, what is functionality behind
‘restart host that lost a connection to storage domain’
BR,
Konstantin
*From: *Patrick Hibbs <hibbsncc1701@gmail.com> *Date: *Wednesday, 30 November 2022 at 16:31 *To: *"users@ovirt.org" <users@ovirt.org> *Subject: *[ovirt-users] Re: Forced restart when losing communication with the Storages
Hello,
I've seen something similar to this too. Although, for me it occurred when a standalone engine attempted to allocate a new disk on a Gluster storage. While the cluster's VMs were experiencing high virtual disk I/O. (Found out later they were doing updates at an odd time...)
The result was random VMs being forced off until it had cleared enough of the bottleneck, and one host was rebooted. After around 3 minutes of wait time. I'm assuming it used ssh as the hosts in question have a configuration problem with their power management and cannot be reset currently by the PDU. But it was still an odd occurrence given that the engine host itself was the cause of the storage "outage."
Is this the correct behavior of oVirt?
-Patrick Hibbs
On 11/30/22 07:45, Murilo Morais wrote:
Konstantin, thank you very much for the explanation, it was very enlightening.
I believe I left something open in the previous message.
I'm using Hosted Engine, all VMs have HA enabled and Power Management is disabled on all hosts. No IPMI configured (at least I didn't configure anything about iLO/IPMI in oVirt).
There was a loss of communication with the Storage for approximately 3 minutes and this caused all Hosts to reboot.
Em qua., 30 de nov. de 2022 às 08:50, Volenbovskyi, Konstantin < Konstantin.Volenbovskyi@haufe.com> escreveu:
Hi,
I would say that you observed ‘fencing’ and not SSH soft fencing, but actual reboot via IPMI.
https://www.ovirt.org/develop/developer-guide/engine/automatic-fencing.html
You can disable Power management for hosts.
Before doing that you need to understand following:
-what is impact on VMs when this happens?
-the working assumption is that your VMs work just fine, but you need to think about other cases where VMs lose their storage and/or network.
For me it seems that this was storage domain that is not a VM storage domain, so VMs’ disks were just fine.
Maybe it was hosted_storage domain in your case…
-any of those VMs are High-availability VMs? Once you disable Power Management you will not have automatic restart on different hosts of those.
You need to understand that idea of fencing is either to recover host automatically and possibly to restart VMs
and make sure that there are no duplicated VMs.
There are 100% cases where fencing is used and there is subset of those, X% number of cases where you would consider that behavior is suboptimal.
The drawback of disabling fencing is that you might get suboptimal behavior in Y% cases (100% minus X%)
BR,
Konstantin
*From: *Murilo Morais <murilo@evocorp.com.br> *Date: *Wednesday, 30 November 2022 at 12:13 *To: *users <users@ovirt.org> *Subject: *[ovirt-users] Forced restart when losing communication with the Storages
Good morning everyone!
Is there a way to disable the forced reboot of the machines? This morning there was an event in our infrastructure where the hosts lost communication with the Storage but this caused all the hosts to restart abruptly.
Would this be the correct behavior of oVirt? Is there any way to disable this?
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/JVNIMYBAXJE3YT...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/YDEO4442DANZ6W...

On 2022-12-13 12:29, Murilo Morais wrote:
Unfortunately, I haven't found anything like this. At least this type of event didn't happen again, but I just can't find anything in the logs that justifies these reboots, I've read everything line by line, if anything was recorded there it passed my eyes.
Reboots may also be triggered by the vsdm watchdog functionality: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XRJXOF3CSDKBKN... AFAICT you should be able to override the watchdog timeout. By using an extremely high value it should in theory be possible to stop automatic reboots. Ciao - Frank
participants (4)
-
Frank Wall
-
Murilo Morais
-
Patrick Hibbs
-
Volenbovskyi, Konstantin