HA - Fencing not working when host with engine gets shutdown - Users

HA - Fencing not working when host with engine gets shutdown

Michael Hölzl

21 Sep 2015 21 Sep '15

4:27 a.m.

Hi all, we are trying to setup an ovirt environment with two hosts, both connected to a ISCSI storage device, a hosted engine and power management configured over ILO. So far it seems to work fine in our testing setup and starting/stopping VMs works smoothly with proper scheduling between those hosts. So we wanted to test HA for the VMs now and started to manually shutdown a host while there are still VMs running on that machine (to simulate power failure or a kernel panic). The expected outcome was that all machines were HA is enabled, are booted again. This works if the machine with the failure does not have the engine running. If the machine with the hosted engine VM gets shutdown, the host gets in the "Not Responsive state" and all VMs end up in an unkown state. However, the engine itself starts correctly on the second host and it seems like it tries to fence the other host (as expected) - Events which we get in the open virtualization manager: 1. Host hosted_engine_2 is non responsive 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to execute Status command on Host hosted_engine_2. 3. Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" 4. Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 124 seconds and after that an attempt to fence the host will be issued. Event 4 is continuously coming every 3 minutes. Complete engine.log file during engine boot up: http://pastebin.com/D6xS3Wfy So the host detects the machine is not responding and wants to fence it. But although the host has power management configured over ILO, the engine thinks that it is not. As a result the second host does not get fenced and VMs are not migrated to the running machine. In the log files there are also a lot of time out exception. But I guess that this is because the host cannot connect to the other machine. Did anybody face similar problems with HA? Or any clue what the problem might be? Thanks, Michael ---- ovirt version: 3.5.4 Hosted engine VM OS: Cent OS 6.5 Host Machines OS: Cent OS 7 P.S. We also have to note that we had problems with the command fence_ipmilan at the beginning. We were receiving the message "Unable to obtain correct plug status or plug is not available," whenever the command fence_ipmilan was called. However, the command fence_ilo4 worked. So we use a simple script for fence_ipmilan now that calls fence_ilo4 and passes the arguments.

Show replies by date

Martin Perina

21 Sep 21 Sep

9 a.m.

New subject: HA - Fencing not working when host with engine gets shutdown

Hi, could you please post whole engine.log (from the time which you turned off the host with engine VM) and also vdsm.log from both hosts? Thanks Martin Perina ----- Original Message -----

...

From: "Michael Hölzl" <mh@ins.jku.at> To: users@ovirt.org Sent: Monday, September 21, 2015 10:27:08 AM Subject: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi all,

we are trying to setup an ovirt environment with two hosts, both connected to a ISCSI storage device, a hosted engine and power management configured over ILO. So far it seems to work fine in our testing setup and starting/stopping VMs works smoothly with proper scheduling between those hosts. So we wanted to test HA for the VMs now and started to manually shutdown a host while there are still VMs running on that machine (to simulate power failure or a kernel panic). The expected outcome was that all machines were HA is enabled, are booted again. This works if the machine with the failure does not have the engine running. If the machine with the hosted engine VM gets shutdown, the host gets in the "Not Responsive state" and all VMs end up in an unkown state. However, the engine itself starts correctly on the second host and it seems like it tries to fence the other host (as expected) - Events which we get in the open virtualization manager: 1. Host hosted_engine_2 is non responsive 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to execute Status command on Host hosted_engine_2. 3. Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" 4. Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 124 seconds and after that an attempt to fence the host will be issued.

Event 4 is continuously coming every 3 minutes. Complete engine.log file during engine boot up: http://pastebin.com/D6xS3Wfy So the host detects the machine is not responding and wants to fence it. But although the host has power management configured over ILO, the engine thinks that it is not. As a result the second host does not get fenced and VMs are not migrated to the running machine. In the log files there are also a lot of time out exception. But I guess that this is because the host cannot connect to the other machine.

Did anybody face similar problems with HA? Or any clue what the problem might be?

Thanks, Michael

---- ovirt version: 3.5.4 Hosted engine VM OS: Cent OS 6.5 Host Machines OS: Cent OS 7

P.S. We also have to note that we had problems with the command fence_ipmilan at the beginning. We were receiving the message "Unable to obtain correct plug status or plug is not available," whenever the command fence_ipmilan was called. However, the command fence_ilo4 worked. So we use a simple script for fence_ipmilan now that calls fence_ilo4 and passes the arguments. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Michael Hölzl

10:47 a.m.

Hi, The whole engine.log including the shutdown time (was performed around 9:19) http://pastebin.com/cdY9uTkJ vdsm.log of host01 (the host which kept on running and took over the engine) split into 3 uploads (limit of 512 kB of pastebin): 1 : http://pastebin.com/dr9jNTek 2 : http://pastebin.com/cuyHL6ne 3 : http://pastebin.com/7x2ZQy1y Michael On 09/21/2015 03:00 PM, Martin Perina wrote:

...

Hi,

could you please post whole engine.log (from the time which you turned off the host with engine VM) and also vdsm.log from both hosts?

Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: users@ovirt.org Sent: Monday, September 21, 2015 10:27:08 AM Subject: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi all,

we are trying to setup an ovirt environment with two hosts, both connected to a ISCSI storage device, a hosted engine and power management configured over ILO. So far it seems to work fine in our testing setup and starting/stopping VMs works smoothly with proper scheduling between those hosts. So we wanted to test HA for the VMs now and started to manually shutdown a host while there are still VMs running on that machine (to simulate power failure or a kernel panic). The expected outcome was that all machines were HA is enabled, are booted again. This works if the machine with the failure does not have the engine running. If the machine with the hosted engine VM gets shutdown, the host gets in the "Not Responsive state" and all VMs end up in an unkown state. However, the engine itself starts correctly on the second host and it seems like it tries to fence the other host (as expected) - Events which we get in the open virtualization manager: 1. Host hosted_engine_2 is non responsive 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to execute Status command on Host hosted_engine_2. 3. Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" 4. Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 124 seconds and after that an attempt to fence the host will be issued.

Event 4 is continuously coming every 3 minutes. Complete engine.log file during engine boot up: http://pastebin.com/D6xS3Wfy So the host detects the machine is not responding and wants to fence it. But although the host has power management configured over ILO, the engine thinks that it is not. As a result the second host does not get fenced and VMs are not migrated to the running machine. In the log files there are also a lot of time out exception. But I guess that this is because the host cannot connect to the other machine.

Did anybody face similar problems with HA? Or any clue what the problem might be?

Thanks, Michael

---- ovirt version: 3.5.4 Hosted engine VM OS: Cent OS 6.5 Host Machines OS: Cent OS 7

P.S. We also have to note that we had problems with the command fence_ipmilan at the beginning. We were receiving the message "Unable to obtain correct plug status or plug is not available," whenever the command fence_ipmilan was called. However, the command fence_ilo4 worked. So we use a simple script for fence_ipmilan now that calls fence_ilo4 and passes the arguments. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Martin Perina

24 Sep 24 Sep

4:02 a.m.

Hi, sorry for the late response, but you hit a "corner case" :-( Let me start explain you a few things first: After startup of engine there's an internval during which fencing is disabled. It's called DisableFenceAtStartupInSec and by default it's set to 5 minutes. It can be changed using engine-config -s DisableFenceAtStartupInSec but please do that with caution. Why do we have such timeout? It's a prevention of fencing storm, which could happen in during power issues in whole DC: when both engine and hosts are started, for huge hosts it may take a lot of time until become up and VDSM start to communicate with engine. So usually engine is started first and without this interval engine will start fencing for hosts which are just starting ... Another thing: if we cannot properly fence the host, we cannot determine if there's not just communication issue between engine and host, so we cannot restart HA VMs on another host. The only thing we can do is to offer "Mark host as rebooted" manual option to administrator. If administrator execution this option, we try to restart HA VMs on different host ASAP, because admin took the responsibility of validation that VMs are really not running. When engine is started, following actions related to fencing are taken: 1. Get status of all hosts from DB and schedule Non Responding Treatment after DisableFenceAtStartupInSec timeout is passed 2. Try to communicate with all host and refresh their status If some host become Non Resposive during DisableFenceAtStartupInSec interval we skip fencing and administator will see message in Events tab that host is Non Responsive, but fencing is disabled due to startup interval. So administrator have to take care of such host manually. Now what happened in your case: 1. Hosted engine VM is running on host1 with other VMs 2. Status of host1 and host2 is Up 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine is running to detect issue with host1 and change its status to Non Responsive 4. In the meantime hosted engine VM is started on host2 -> it will read host status from DB, but all hosts are up -> it will try to communicate with host1, but it's unreachable -> so it changes host1 status Non Responsive and starts Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted because engine is still in DisableFenceAtStartupInSec So in normal deployment (without hosted engine) admin is notified that host, where engine is running, crashed and was rebooted, so he has to take a look and do manual steps if needed. In hosted engine deployment it's an issue because hosted engine VM can be restart on different host also in other cases then crashes (for example if host is overloaded hosted engine can stop hosted engine VM and restart it on different host, but this shouldn't happen too often). At the moment the only solution for this is manual: let administrator to be notified that host engine VM is restarted on different host, so administrator can check manually what was the cause for this restart and execute manual steps if needed. So to summarize: at the moment I don't see any reliable automatic solution for this :-( and fencing storm prevention is more important. But feel free to create a bug for this issue, maybe we can think of at least some improvement for this use case. Thanks Martin Perina ----- Original Message -----

...

From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com> Cc: users@ovirt.org Sent: Monday, September 21, 2015 4:47:06 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

The whole engine.log including the shutdown time (was performed around 9:19) http://pastebin.com/cdY9uTkJ

vdsm.log of host01 (the host which kept on running and took over the engine) split into 3 uploads (limit of 512 kB of pastebin): 1 : http://pastebin.com/dr9jNTek 2 : http://pastebin.com/cuyHL6ne 3 : http://pastebin.com/7x2ZQy1y

Michael

On 09/21/2015 03:00 PM, Martin Perina wrote:

...
Hi,

could you please post whole engine.log (from the time which you turned off the host with engine VM) and also vdsm.log from both hosts?

Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: users@ovirt.org Sent: Monday, September 21, 2015 10:27:08 AM Subject: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi all,

we are trying to setup an ovirt environment with two hosts, both connected to a ISCSI storage device, a hosted engine and power management configured over ILO. So far it seems to work fine in our testing setup and starting/stopping VMs works smoothly with proper scheduling between those hosts. So we wanted to test HA for the VMs now and started to manually shutdown a host while there are still VMs running on that machine (to simulate power failure or a kernel panic). The expected outcome was that all machines were HA is enabled, are booted again. This works if the machine with the failure does not have the engine running. If the machine with the hosted engine VM gets shutdown, the host gets in the "Not Responsive state" and all VMs end up in an unkown state. However, the engine itself starts correctly on the second host and it seems like it tries to fence the other host (as expected) - Events which we get in the open virtualization manager: 1. Host hosted_engine_2 is non responsive 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to execute Status command on Host hosted_engine_2. 3. Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" 4. Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 124 seconds and after that an attempt to fence the host will be issued.

Event 4 is continuously coming every 3 minutes. Complete engine.log file during engine boot up: http://pastebin.com/D6xS3Wfy So the host detects the machine is not responding and wants to fence it. But although the host has power management configured over ILO, the engine thinks that it is not. As a result the second host does not get fenced and VMs are not migrated to the running machine. In the log files there are also a lot of time out exception. But I guess that this is because the host cannot connect to the other machine.

Did anybody face similar problems with HA? Or any clue what the problem might be?

Thanks, Michael

---- ovirt version: 3.5.4 Hosted engine VM OS: Cent OS 6.5 Host Machines OS: Cent OS 7

P.S. We also have to note that we had problems with the command fence_ipmilan at the beginning. We were receiving the message "Unable to obtain correct plug status or plug is not available," whenever the command fence_ipmilan was called. However, the command fence_ilo4 worked. So we use a simple script for fence_ipmilan now that calls fence_ilo4 and passes the arguments. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Eli Mesika

5:38 a.m.

----- Original Message -----

...

From: "Martin Perina" <mperina@redhat.com> To: "Michael Hölzl" <mh@ins.jku.at> Cc: users@ovirt.org Sent: Thursday, September 24, 2015 11:02:21 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

sorry for the late response, but you hit a "corner case" :-(

Let me start explain you a few things first:

After startup of engine there's an internval during which fencing is disabled. It's called DisableFenceAtStartupInSec and by default it's set to 5 minutes. It can be changed using

engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which could happen in during power issues in whole DC: when both engine and hosts are started, for huge hosts it may take a lot of time until become up and VDSM start to communicate with engine. So usually engine is started first and without this interval engine will start fencing for hosts which are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine if there's not just communication issue between engine and host, so we cannot restart HA VMs on another host. The only thing we can do is to offer "Mark host as rebooted" manual option to administrator. If administrator execution this option, we try to restart HA VMs on different host ASAP, because admin took the responsibility of validation that VMs are really not running.

When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status

If some host become Non Resposive during DisableFenceAtStartupInSec interval we skip fencing and administator will see message in Events tab that host is Non Responsive, but fencing is disabled due to startup interval. So administrator have to take care of such host manually.

Now what happened in your case:

1. Hosted engine VM is running on host1 with other VMs 2. Status of host1 and host2 is Up 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine is running to detect issue with host1 and change its status to Non Responsive 4. In the meantime hosted engine VM is started on host2 -> it will read host status from DB, but all hosts are up -> it will try to communicate with host1, but it's unreachable -> so it changes host1 status Non Responsive and starts Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted because engine is still in DisableFenceAtStartupInSec

So in normal deployment (without hosted engine) admin is notified that host, where engine is running, crashed and was rebooted, so he has to take a look and do manual steps if needed.

In hosted engine deployment it's an issue because hosted engine VM can be restart on different host also in other cases then crashes (for example if host is overloaded hosted engine can stop hosted engine VM and restart it on different host, but this shouldn't happen too often).

At the moment the only solution for this is manual: let administrator to be notified that host engine VM is restarted on different host, so administrator can check manually what was the cause for this restart and execute manual steps if needed.

So to summarize: at the moment I don't see any reliable automatic solution for this :-( and fencing storm prevention is more important. But feel free to create a bug for this issue, maybe we can think of at least some improvement for this use case.

Thanks for the detailed explanation Martin Really a corner case, lets see if we got more inputs on that from other users Maybe when hosted engine VM is restarted on another node we can ask for the reason and act accordingly Doron, with current implementation, is the reason for hosted engine VM restart stored anywhere ?

...

Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com> Cc: users@ovirt.org Sent: Monday, September 21, 2015 4:47:06 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

The whole engine.log including the shutdown time (was performed around 9:19) http://pastebin.com/cdY9uTkJ

vdsm.log of host01 (the host which kept on running and took over the engine) split into 3 uploads (limit of 512 kB of pastebin): 1 : http://pastebin.com/dr9jNTek 2 : http://pastebin.com/cuyHL6ne 3 : http://pastebin.com/7x2ZQy1y

Michael

On 09/21/2015 03:00 PM, Martin Perina wrote:

...
Hi,

could you please post whole engine.log (from the time which you turned off the host with engine VM) and also vdsm.log from both hosts?

Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: users@ovirt.org Sent: Monday, September 21, 2015 10:27:08 AM Subject: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi all,

we are trying to setup an ovirt environment with two hosts, both connected to a ISCSI storage device, a hosted engine and power management configured over ILO. So far it seems to work fine in our testing setup and starting/stopping VMs works smoothly with proper scheduling between those hosts. So we wanted to test HA for the VMs now and started to manually shutdown a host while there are still VMs running on that machine (to simulate power failure or a kernel panic). The expected outcome was that all machines were HA is enabled, are booted again. This works if the machine with the failure does not have the engine running. If the machine with the hosted engine VM gets shutdown, the host gets in the "Not Responsive state" and all VMs end up in an unkown state. However, the engine itself starts correctly on the second host and it seems like it tries to fence the other host (as expected) - Events which we get in the open virtualization manager: 1. Host hosted_engine_2 is non responsive 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to execute Status command on Host hosted_engine_2. 3. Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" 4. Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 124 seconds and after that an attempt to fence the host will be issued.

Event 4 is continuously coming every 3 minutes. Complete engine.log file during engine boot up: http://pastebin.com/D6xS3Wfy So the host detects the machine is not responding and wants to fence it. But although the host has power management configured over ILO, the engine thinks that it is not. As a result the second host does not get fenced and VMs are not migrated to the running machine. In the log files there are also a lot of time out exception. But I guess that this is because the host cannot connect to the other machine.

Did anybody face similar problems with HA? Or any clue what the problem might be?

Thanks, Michael

---- ovirt version: 3.5.4 Hosted engine VM OS: Cent OS 6.5 Host Machines OS: Cent OS 7

P.S. We also have to note that we had problems with the command fence_ipmilan at the beginning. We were receiving the message "Unable to obtain correct plug status or plug is not available," whenever the command fence_ipmilan was called. However, the command fence_ilo4 worked. So we use a simple script for fence_ipmilan now that calls fence_ilo4 and passes the arguments. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Martin Perina

5:50 a.m.

----- Original Message -----

...

From: "Eli Mesika" <emesika@redhat.com> To: "Martin Perina" <mperina@redhat.com>, "Doron Fediuck" <dfediuck@redhat.com> Cc: "Michael Hölzl" <mh@ins.jku.at>, users@ovirt.org Sent: Thursday, September 24, 2015 11:38:39 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

----- Original Message -----

...
From: "Martin Perina" <mperina@redhat.com> To: "Michael Hölzl" <mh@ins.jku.at> Cc: users@ovirt.org Sent: Thursday, September 24, 2015 11:02:21 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

sorry for the late response, but you hit a "corner case" :-(

Let me start explain you a few things first:

After startup of engine there's an internval during which fencing is disabled. It's called DisableFenceAtStartupInSec and by default it's set to 5 minutes. It can be changed using

engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which could happen in during power issues in whole DC: when both engine and hosts are started, for huge hosts it may take a lot of time until become up and VDSM start to communicate with engine. So usually engine is started first and without this interval engine will start fencing for hosts which are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine if there's not just communication issue between engine and host, so we cannot restart HA VMs on another host. The only thing we can do is to offer "Mark host as rebooted" manual option to administrator. If administrator execution this option, we try to restart HA VMs on different host ASAP, because admin took the responsibility of validation that VMs are really not running.

When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status

If some host become Non Resposive during DisableFenceAtStartupInSec interval we skip fencing and administator will see message in Events tab that host is Non Responsive, but fencing is disabled due to startup interval. So administrator have to take care of such host manually.

Now what happened in your case:

1. Hosted engine VM is running on host1 with other VMs 2. Status of host1 and host2 is Up 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine is running to detect issue with host1 and change its status to Non Responsive 4. In the meantime hosted engine VM is started on host2 -> it will read host status from DB, but all hosts are up -> it will try to communicate with host1, but it's unreachable -> so it changes host1 status Non Responsive and starts Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted because engine is still in DisableFenceAtStartupInSec

So in normal deployment (without hosted engine) admin is notified that host, where engine is running, crashed and was rebooted, so he has to take a look and do manual steps if needed.

In hosted engine deployment it's an issue because hosted engine VM can be restart on different host also in other cases then crashes (for example if host is overloaded hosted engine can stop hosted engine VM and restart it on different host, but this shouldn't happen too often).

At the moment the only solution for this is manual: let administrator to be notified that host engine VM is restarted on different host, so administrator can check manually what was the cause for this restart and execute manual steps if needed.

So to summarize: at the moment I don't see any reliable automatic solution for this :-( and fencing storm prevention is more important. But feel free to create a bug for this issue, maybe we can think of at least some improvement for this use case.

Thanks for the detailed explanation Martin Really a corner case, lets see if we got more inputs on that from other users Maybe when hosted engine VM is restarted on another node we can ask for the reason and act accordingly Doron, with current implementation, is the reason for hosted engine VM restart stored anywhere ?

I have already discussed this with Martin Sivak and hosted engine doesn't touch engine db at all. We discussed this possible solution with Martin, which we could do in master and maybe in 3.6 if agreed: 1. Just after start of engine we can read from the db name of the host which hosted engine VM is running on and store it somewhere in memory for Non Responding Treatment 2. As a part of Non Responding Treatment we can some hosted engine specific logic: IF we are running as hosted engine AND we are inside DisableFenceAtStartupInSec internal AND non responsive host is the host stored above in step 1. AND hosted engine VM is running on different host THEN execute fencing for non responsive host even when we are inside DisableFenceAtStartupInSec internal But it can cause unnecessary fence for the case that whole datacenter recovers from power failure.

...

...
Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com> Cc: users@ovirt.org Sent: Monday, September 21, 2015 4:47:06 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

The whole engine.log including the shutdown time (was performed around 9:19) http://pastebin.com/cdY9uTkJ

vdsm.log of host01 (the host which kept on running and took over the engine) split into 3 uploads (limit of 512 kB of pastebin): 1 : http://pastebin.com/dr9jNTek 2 : http://pastebin.com/cuyHL6ne 3 : http://pastebin.com/7x2ZQy1y

Michael

On 09/21/2015 03:00 PM, Martin Perina wrote:

...
Hi,

could you please post whole engine.log (from the time which you turned off the host with engine VM) and also vdsm.log from both hosts?

Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: users@ovirt.org Sent: Monday, September 21, 2015 10:27:08 AM Subject: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi all,

we are trying to setup an ovirt environment with two hosts, both connected to a ISCSI storage device, a hosted engine and power management configured over ILO. So far it seems to work fine in our testing setup and starting/stopping VMs works smoothly with proper scheduling between those hosts. So we wanted to test HA for the VMs now and started to manually shutdown a host while there are still VMs running on that machine (to simulate power failure or a kernel panic). The expected outcome was that all machines were HA is enabled, are booted again. This works if the machine with the failure does not have the engine running. If the machine with the hosted engine VM gets shutdown, the host gets in the "Not Responsive state" and all VMs end up in an unkown state. However, the engine itself starts correctly on the second host and it seems like it tries to fence the other host (as expected) - Events which we get in the open virtualization manager: 1. Host hosted_engine_2 is non responsive 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to execute Status command on Host hosted_engine_2. 3. Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" 4. Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 124 seconds and after that an attempt to fence the host will be issued.

Event 4 is continuously coming every 3 minutes. Complete engine.log file during engine boot up: http://pastebin.com/D6xS3Wfy So the host detects the machine is not responding and wants to fence it. But although the host has power management configured over ILO, the engine thinks that it is not. As a result the second host does not get fenced and VMs are not migrated to the running machine. In the log files there are also a lot of time out exception. But I guess that this is because the host cannot connect to the other machine.

Did anybody face similar problems with HA? Or any clue what the problem might be?

Thanks, Michael

---- ovirt version: 3.5.4 Hosted engine VM OS: Cent OS 6.5 Host Machines OS: Cent OS 7

P.S. We also have to note that we had problems with the command fence_ipmilan at the beginning. We were receiving the message "Unable to obtain correct plug status or plug is not available," whenever the command fence_ipmilan was called. However, the command fence_ilo4 worked. So we use a simple script for fence_ipmilan now that calls fence_ilo4 and passes the arguments. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Michael Hölzl

6:35 a.m.

Hi, thanks for the detailed answer! In principle, I understand the issue now. However, I can not fully follow the argument that this is a corner case. In a smaller or medium sized company, I would assume that such a setup, consisting of two machine with a hosted engine, is not uncommon. Especially as there is documentation online which describes how to deploy this setup. Does that mean that hosted engines are in general not recommended? I am also wondering why the fencing could not be triggered by the hosted engine after the DisableFenceAtStartupInSec timeout? In the events log of the engine I keep on getting the message "Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued.", which would indicate that the engine is actually trying to fence the non responsive host. On 09/24/2015 11:50 AM, Martin Perina wrote:

...

----- Original Message -----

...
From: "Eli Mesika" <emesika@redhat.com> To: "Martin Perina" <mperina@redhat.com>, "Doron Fediuck" <dfediuck@redhat.com> Cc: "Michael Hölzl" <mh@ins.jku.at>, users@ovirt.org Sent: Thursday, September 24, 2015 11:38:39 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

----- Original Message -----

...
From: "Martin Perina" <mperina@redhat.com> To: "Michael Hölzl" <mh@ins.jku.at> Cc: users@ovirt.org Sent: Thursday, September 24, 2015 11:02:21 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

sorry for the late response, but you hit a "corner case" :-(

Let me start explain you a few things first:

After startup of engine there's an internval during which fencing is disabled. It's called DisableFenceAtStartupInSec and by default it's set to 5 minutes. It can be changed using

engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which could happen in during power issues in whole DC: when both engine and hosts are started, for huge hosts it may take a lot of time until become up and VDSM start to communicate with engine. So usually engine is started first and without this interval engine will start fencing for hosts which are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine if there's not just communication issue between engine and host, so we cannot restart HA VMs on another host. The only thing we can do is to offer "Mark host as rebooted" manual option to administrator. If administrator execution this option, we try to restart HA VMs on different host ASAP, because admin took the responsibility of validation that VMs are really not running.

When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status

If some host become Non Resposive during DisableFenceAtStartupInSec interval we skip fencing and administator will see message in Events tab that host is Non Responsive, but fencing is disabled due to startup interval. So administrator have to take care of such host manually.

Now what happened in your case:

1. Hosted engine VM is running on host1 with other VMs 2. Status of host1 and host2 is Up 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine is running to detect issue with host1 and change its status to Non Responsive 4. In the meantime hosted engine VM is started on host2 -> it will read host status from DB, but all hosts are up -> it will try to communicate with host1, but it's unreachable -> so it changes host1 status Non Responsive and starts Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted because engine is still in DisableFenceAtStartupInSec

So in normal deployment (without hosted engine) admin is notified that host, where engine is running, crashed and was rebooted, so he has to take a look and do manual steps if needed.

In hosted engine deployment it's an issue because hosted engine VM can be restart on different host also in other cases then crashes (for example if host is overloaded hosted engine can stop hosted engine VM and restart it on different host, but this shouldn't happen too often).

At the moment the only solution for this is manual: let administrator to be notified that host engine VM is restarted on different host, so administrator can check manually what was the cause for this restart and execute manual steps if needed.

So to summarize: at the moment I don't see any reliable automatic solution for this :-( and fencing storm prevention is more important. But feel free to create a bug for this issue, maybe we can think of at least some improvement for this use case. Thanks for the detailed explanation Martin Really a corner case, lets see if we got more inputs on that from other users Maybe when hosted engine VM is restarted on another node we can ask for the reason and act accordingly Doron, with current implementation, is the reason for hosted engine VM restart stored anywhere ? I have already discussed this with Martin Sivak and hosted engine doesn't touch engine db at all. We discussed this possible solution with Martin, which we could do in master and maybe in 3.6 if agreed:

1. Just after start of engine we can read from the db name of the host which hosted engine VM is running on and store it somewhere in memory for Non Responding Treatment

2. As a part of Non Responding Treatment we can some hosted engine specific logic: IF we are running as hosted engine AND we are inside DisableFenceAtStartupInSec internal AND non responsive host is the host stored above in step 1. AND hosted engine VM is running on different host THEN execute fencing for non responsive host even when we are inside DisableFenceAtStartupInSec internal

But it can cause unnecessary fence for the case that whole datacenter recovers from power failure.

...
...
Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com> Cc: users@ovirt.org Sent: Monday, September 21, 2015 4:47:06 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

The whole engine.log including the shutdown time (was performed around 9:19) http://pastebin.com/cdY9uTkJ

vdsm.log of host01 (the host which kept on running and took over the engine) split into 3 uploads (limit of 512 kB of pastebin): 1 : http://pastebin.com/dr9jNTek 2 : http://pastebin.com/cuyHL6ne 3 : http://pastebin.com/7x2ZQy1y

Michael

On 09/21/2015 03:00 PM, Martin Perina wrote:

...
Hi,

could you please post whole engine.log (from the time which you turned off the host with engine VM) and also vdsm.log from both hosts?

Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: users@ovirt.org Sent: Monday, September 21, 2015 10:27:08 AM Subject: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi all,

we are trying to setup an ovirt environment with two hosts, both connected to a ISCSI storage device, a hosted engine and power management configured over ILO. So far it seems to work fine in our testing setup and starting/stopping VMs works smoothly with proper scheduling between those hosts. So we wanted to test HA for the VMs now and started to manually shutdown a host while there are still VMs running on that machine (to simulate power failure or a kernel panic). The expected outcome was that all machines were HA is enabled, are booted again. This works if the machine with the failure does not have the engine running. If the machine with the hosted engine VM gets shutdown, the host gets in the "Not Responsive state" and all VMs end up in an unkown state. However, the engine itself starts correctly on the second host and it seems like it tries to fence the other host (as expected) - Events which we get in the open virtualization manager: 1. Host hosted_engine_2 is non responsive 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to execute Status command on Host hosted_engine_2. 3. Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" 4. Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 124 seconds and after that an attempt to fence the host will be issued.

Event 4 is continuously coming every 3 minutes. Complete engine.log file during engine boot up: http://pastebin.com/D6xS3Wfy So the host detects the machine is not responding and wants to fence it. But although the host has power management configured over ILO, the engine thinks that it is not. As a result the second host does not get fenced and VMs are not migrated to the running machine. In the log files there are also a lot of time out exception. But I guess that this is because the host cannot connect to the other machine.

Did anybody face similar problems with HA? Or any clue what the problem might be?

Thanks, Michael

---- ovirt version: 3.5.4 Hosted engine VM OS: Cent OS 6.5 Host Machines OS: Cent OS 7

P.S. We also have to note that we had problems with the command fence_ipmilan at the beginning. We were receiving the message "Unable to obtain correct plug status or plug is not available," whenever the command fence_ipmilan was called. However, the command fence_ilo4 worked. So we use a simple script for fence_ipmilan now that calls fence_ilo4 and passes the arguments. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Martin Perina

7:24 a.m.

----- Original Message -----

...

From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com>, "Eli Mesika" <emesika@redhat.com> Cc: "Doron Fediuck" <dfediuck@redhat.com>, users@ovirt.org Sent: Thursday, September 24, 2015 12:35:13 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

thanks for the detailed answer! In principle, I understand the issue now. However, I can not fully follow the argument that this is a corner case. In a smaller or medium sized company, I would assume that such a setup, consisting of two machine with a hosted engine, is not uncommon. Especially as there is documentation online which describes how to deploy this setup. Does that mean that hosted engines are in general not recommended?

I am also wondering why the fencing could not be triggered by the hosted engine after the DisableFenceAtStartupInSec timeout? In the events log of the engine I keep on getting the message "Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued.", which would indicate that the engine is actually trying to fence the non responsive host.

Unfortunately this is a bit misleading message, it's shown every time that we start handling network exception for the host and it's fired before the logic which manages to start/skip fencing process (this misleading message is fixed in 3.6). But in current logic we really execute fencing only when host status is about to change from Connecting to NonResponsive and this happens only for the 1st time when we are still in DisableFenceAtStartupInSec interval. During all other attempts the host is already in status Non Responsive, so fencing is skipped.

...

On 09/24/2015 11:50 AM, Martin Perina wrote:

...
----- Original Message -----

...
From: "Eli Mesika" <emesika@redhat.com> To: "Martin Perina" <mperina@redhat.com>, "Doron Fediuck" <dfediuck@redhat.com> Cc: "Michael Hölzl" <mh@ins.jku.at>, users@ovirt.org Sent: Thursday, September 24, 2015 11:38:39 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

...
From: "Martin Perina" <mperina@redhat.com> To: "Michael Hölzl" <mh@ins.jku.at> Cc: users@ovirt.org Sent: Thursday, September 24, 2015 11:02:21 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

sorry for the late response, but you hit a "corner case" :-(

Let me start explain you a few things first:

After startup of engine there's an internval during which fencing is disabled. It's called DisableFenceAtStartupInSec and by default it's set to 5 minutes. It can be changed using

engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which could happen in during power issues in whole DC: when both engine and hosts are started, for huge hosts it may take a lot of time until become up and VDSM start to communicate with engine. So usually engine is started first and without this interval engine will start fencing for hosts which are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine if there's not just communication issue between engine and host, so we cannot restart HA VMs on another host. The only thing we can do is to offer "Mark host as rebooted" manual option to administrator. If administrator execution this option, we try to restart HA VMs on different host ASAP, because admin took the responsibility of validation that VMs are really not running.

When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status

If some host become Non Resposive during DisableFenceAtStartupInSec interval we skip fencing and administator will see message in Events tab that host is Non Responsive, but fencing is disabled due to startup interval. So administrator have to take care of such host manually.

Now what happened in your case:

1. Hosted engine VM is running on host1 with other VMs 2. Status of host1 and host2 is Up 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine is running to detect issue with host1 and change its status to Non Responsive 4. In the meantime hosted engine VM is started on host2 -> it will read host status from DB, but all hosts are up -> it will try to communicate with host1, but it's unreachable -> so it changes host1 status Non Responsive and starts Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted because engine is still in DisableFenceAtStartupInSec

So in normal deployment (without hosted engine) admin is notified that host, where engine is running, crashed and was rebooted, so he has to take a look and do manual steps if needed.

In hosted engine deployment it's an issue because hosted engine VM can be restart on different host also in other cases then crashes (for example if host is overloaded hosted engine can stop hosted engine VM and restart it on different host, but this shouldn't happen too often).

At the moment the only solution for this is manual: let administrator to be notified that host engine VM is restarted on different host, so administrator can check manually what was the cause for this restart and execute manual steps if needed.

So to summarize: at the moment I don't see any reliable automatic solution for this :-( and fencing storm prevention is more important. But feel free to create a bug for this issue, maybe we can think of at least some improvement for this use case. Thanks for the detailed explanation Martin Really a corner case, lets see if we got more inputs on that from other users Maybe when hosted engine VM is restarted on another node we can ask for

----- Original Message ----- the reason and act accordingly Doron, with current implementation, is the reason for hosted engine VM restart stored anywhere ? I have already discussed this with Martin Sivak and hosted engine doesn't touch engine db at all. We discussed this possible solution with Martin, which we could do in master and maybe in 3.6 if agreed:

1. Just after start of engine we can read from the db name of the host which hosted engine VM is running on and store it somewhere in memory for Non Responding Treatment

2. As a part of Non Responding Treatment we can some hosted engine specific logic: IF we are running as hosted engine AND we are inside DisableFenceAtStartupInSec internal AND non responsive host is the host stored above in step 1. AND hosted engine VM is running on different host THEN execute fencing for non responsive host even when we are inside DisableFenceAtStartupInSec internal

But it can cause unnecessary fence for the case that whole datacenter recovers from power failure.

...
...
Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com> Cc: users@ovirt.org Sent: Monday, September 21, 2015 4:47:06 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

The whole engine.log including the shutdown time (was performed around 9:19) http://pastebin.com/cdY9uTkJ

vdsm.log of host01 (the host which kept on running and took over the engine) split into 3 uploads (limit of 512 kB of pastebin): 1 : http://pastebin.com/dr9jNTek 2 : http://pastebin.com/cuyHL6ne 3 : http://pastebin.com/7x2ZQy1y

Michael

On 09/21/2015 03:00 PM, Martin Perina wrote:

...
Hi,

could you please post whole engine.log (from the time which you turned off the host with engine VM) and also vdsm.log from both hosts?

Thanks

Martin Perina

----- Original Message ----- > From: "Michael Hölzl" <mh@ins.jku.at> > To: users@ovirt.org > Sent: Monday, September 21, 2015 10:27:08 AM > Subject: [ovirt-users] HA - Fencing not working when host with engine > gets > shutdown > > Hi all, > > we are trying to setup an ovirt environment with two hosts, both > connected to a ISCSI storage device, a hosted engine and power > management configured over ILO. So far it seems to work fine in our > testing setup and starting/stopping VMs works smoothly with proper > scheduling between those hosts. So we wanted to test HA for the VMs > now > and started to manually shutdown a host while there are still VMs > running on that machine (to simulate power failure or a kernel panic). > The expected outcome was that all machines were HA is enabled, are > booted again. This works if the machine with the failure does not have > the engine running. If the machine with the hosted engine VM gets > shutdown, the host gets in the "Not Responsive state" and all VMs end > up > in an unkown state. However, the engine itself starts correctly on the > second host and it seems like it tries to fence the other host (as > expected) - Events which we get in the open virtualization manager: > 1. Host hosted_engine_2 is non responsive > 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to > execute Status command on Host hosted_engine_2. > 3. Host hosted_engine_2 became non responsive. It has no power > management configured. Please check the host status, manually reboot > it, > and click "Confirm Host Has Been Rebooted" > 4. Host hosted_engine_2 is not responding. It will stay in Connecting > state for a grace period of 124 seconds and after that an attempt to > fence the host will be issued. > > Event 4 is continuously coming every 3 minutes. Complete engine.log > file > during engine boot up: http://pastebin.com/D6xS3Wfy > So the host detects the machine is not responding and wants to fence > it. > But although the host has power management configured over ILO, the > engine thinks that it is not. As a result the second host does not get > fenced and VMs are not migrated to the running machine. > In the log files there are also a lot of time out exception. But I > guess > that this is because the host cannot connect to the other machine. > > Did anybody face similar problems with HA? Or any clue what the > problem > might be? > > Thanks, > Michael > > > ---- > ovirt version: 3.5.4 > Hosted engine VM OS: Cent OS 6.5 > Host Machines OS: Cent OS 7 > > P.S. We also have to note that we had problems with the command > fence_ipmilan at the beginning. We were receiving the message "Unable > to > obtain correct plug status or plug is not available," whenever the > command fence_ipmilan was called. However, the command fence_ilo4 > worked. So we use a simple script for fence_ipmilan now that calls > fence_ilo4 and passes the arguments. > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Michael Hölzl

8:31 a.m.

Ok, thanks! So, I would still like to know if you would recommend not to use hosted engines but rather another machine for the engine? On 09/24/2015 01:24 PM, Martin Perina wrote:

...

...
From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com>, "Eli Mesika" <emesika@redhat.com> Cc: "Doron Fediuck" <dfediuck@redhat.com>, users@ovirt.org Sent: Thursday, September 24, 2015 12:35:13 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

thanks for the detailed answer! In principle, I understand the issue now. However, I can not fully follow the argument that this is a corner case. In a smaller or medium sized company, I would assume that such a setup, consisting of two machine with a hosted engine, is not uncommon. Especially as there is documentation online which describes how to deploy this setup. Does that mean that hosted engines are in general not recommended?

I am also wondering why the fencing could not be triggered by the hosted engine after the DisableFenceAtStartupInSec timeout? In the events log of the engine I keep on getting the message "Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued.", which would indicate that the engine is actually trying to fence the non responsive host. Unfortunately this is a bit misleading message, it's shown every time that we start handling network exception for the host and it's fired before

----- Original Message ----- the logic which manages to start/skip fencing process (this misleading message is fixed in 3.6). But in current logic we really execute fencing only when host status is about to change from Connecting to NonResponsive and this happens only for the 1st time when we are still in DisableFenceAtStartupInSec interval. During all other attempts the host is already in status Non Responsive, so fencing is skipped.

...
On 09/24/2015 11:50 AM, Martin Perina wrote:

...
----- Original Message -----

...
From: "Eli Mesika" <emesika@redhat.com> To: "Martin Perina" <mperina@redhat.com>, "Doron Fediuck" <dfediuck@redhat.com> Cc: "Michael Hölzl" <mh@ins.jku.at>, users@ovirt.org Sent: Thursday, September 24, 2015 11:38:39 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

...
From: "Martin Perina" <mperina@redhat.com> To: "Michael Hölzl" <mh@ins.jku.at> Cc: users@ovirt.org Sent: Thursday, September 24, 2015 11:02:21 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

sorry for the late response, but you hit a "corner case" :-(

Let me start explain you a few things first:

After startup of engine there's an internval during which fencing is disabled. It's called DisableFenceAtStartupInSec and by default it's set to 5 minutes. It can be changed using

engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which could happen in during power issues in whole DC: when both engine and hosts are started, for huge hosts it may take a lot of time until become up and VDSM start to communicate with engine. So usually engine is started first and without this interval engine will start fencing for hosts which are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine if there's not just communication issue between engine and host, so we cannot restart HA VMs on another host. The only thing we can do is to offer "Mark host as rebooted" manual option to administrator. If administrator execution this option, we try to restart HA VMs on different host ASAP, because admin took the responsibility of validation that VMs are really not running.

When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status

If some host become Non Resposive during DisableFenceAtStartupInSec interval we skip fencing and administator will see message in Events tab that host is Non Responsive, but fencing is disabled due to startup interval. So administrator have to take care of such host manually.

Now what happened in your case:

1. Hosted engine VM is running on host1 with other VMs 2. Status of host1 and host2 is Up 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine is running to detect issue with host1 and change its status to Non Responsive 4. In the meantime hosted engine VM is started on host2 -> it will read host status from DB, but all hosts are up -> it will try to communicate with host1, but it's unreachable -> so it changes host1 status Non Responsive and starts Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted because engine is still in DisableFenceAtStartupInSec

So in normal deployment (without hosted engine) admin is notified that host, where engine is running, crashed and was rebooted, so he has to take a look and do manual steps if needed.

In hosted engine deployment it's an issue because hosted engine VM can be restart on different host also in other cases then crashes (for example if host is overloaded hosted engine can stop hosted engine VM and restart it on different host, but this shouldn't happen too often).

At the moment the only solution for this is manual: let administrator to be notified that host engine VM is restarted on different host, so administrator can check manually what was the cause for this restart and execute manual steps if needed.

So to summarize: at the moment I don't see any reliable automatic solution for this :-( and fencing storm prevention is more important. But feel free to create a bug for this issue, maybe we can think of at least some improvement for this use case. Thanks for the detailed explanation Martin Really a corner case, lets see if we got more inputs on that from other users Maybe when hosted engine VM is restarted on another node we can ask for

----- Original Message ----- the reason and act accordingly Doron, with current implementation, is the reason for hosted engine VM restart stored anywhere ? I have already discussed this with Martin Sivak and hosted engine doesn't touch engine db at all. We discussed this possible solution with Martin, which we could do in master and maybe in 3.6 if agreed:

1. Just after start of engine we can read from the db name of the host which hosted engine VM is running on and store it somewhere in memory for Non Responding Treatment

2. As a part of Non Responding Treatment we can some hosted engine specific logic: IF we are running as hosted engine AND we are inside DisableFenceAtStartupInSec internal AND non responsive host is the host stored above in step 1. AND hosted engine VM is running on different host THEN execute fencing for non responsive host even when we are inside DisableFenceAtStartupInSec internal

But it can cause unnecessary fence for the case that whole datacenter recovers from power failure.

...
...
Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com> Cc: users@ovirt.org Sent: Monday, September 21, 2015 4:47:06 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

The whole engine.log including the shutdown time (was performed around 9:19) http://pastebin.com/cdY9uTkJ

vdsm.log of host01 (the host which kept on running and took over the engine) split into 3 uploads (limit of 512 kB of pastebin): 1 : http://pastebin.com/dr9jNTek 2 : http://pastebin.com/cuyHL6ne 3 : http://pastebin.com/7x2ZQy1y

Michael

On 09/21/2015 03:00 PM, Martin Perina wrote: > Hi, > > could you please post whole engine.log (from the time which you turned > off > the host with engine VM) and also vdsm.log from both hosts? > > Thanks > > Martin Perina > > ----- Original Message ----- >> From: "Michael Hölzl" <mh@ins.jku.at> >> To: users@ovirt.org >> Sent: Monday, September 21, 2015 10:27:08 AM >> Subject: [ovirt-users] HA - Fencing not working when host with engine >> gets >> shutdown >> >> Hi all, >> >> we are trying to setup an ovirt environment with two hosts, both >> connected to a ISCSI storage device, a hosted engine and power >> management configured over ILO. So far it seems to work fine in our >> testing setup and starting/stopping VMs works smoothly with proper >> scheduling between those hosts. So we wanted to test HA for the VMs >> now >> and started to manually shutdown a host while there are still VMs >> running on that machine (to simulate power failure or a kernel panic). >> The expected outcome was that all machines were HA is enabled, are >> booted again. This works if the machine with the failure does not have >> the engine running. If the machine with the hosted engine VM gets >> shutdown, the host gets in the "Not Responsive state" and all VMs end >> up >> in an unkown state. However, the engine itself starts correctly on the >> second host and it seems like it tries to fence the other host (as >> expected) - Events which we get in the open virtualization manager: >> 1. Host hosted_engine_2 is non responsive >> 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to >> execute Status command on Host hosted_engine_2. >> 3. Host hosted_engine_2 became non responsive. It has no power >> management configured. Please check the host status, manually reboot >> it, >> and click "Confirm Host Has Been Rebooted" >> 4. Host hosted_engine_2 is not responding. It will stay in Connecting >> state for a grace period of 124 seconds and after that an attempt to >> fence the host will be issued. >> >> Event 4 is continuously coming every 3 minutes. Complete engine.log >> file >> during engine boot up: http://pastebin.com/D6xS3Wfy >> So the host detects the machine is not responding and wants to fence >> it. >> But although the host has power management configured over ILO, the >> engine thinks that it is not. As a result the second host does not get >> fenced and VMs are not migrated to the running machine. >> In the log files there are also a lot of time out exception. But I >> guess >> that this is because the host cannot connect to the other machine. >> >> Did anybody face similar problems with HA? Or any clue what the >> problem >> might be? >> >> Thanks, >> Michael >> >> >> ---- >> ovirt version: 3.5.4 >> Hosted engine VM OS: Cent OS 6.5 >> Host Machines OS: Cent OS 7 >> >> P.S. We also have to note that we had problems with the command >> fence_ipmilan at the beginning. We were receiving the message "Unable >> to >> obtain correct plug status or plug is not available," whenever the >> command fence_ipmilan was called. However, the command fence_ilo4 >> worked. So we use a simple script for fence_ipmilan now that calls >> fence_ilo4 and passes the arguments. >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >>

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Martin Sivak

8:59 a.m.

Hi Michael, Martin summed the situation neatly, I would just add that this issue is not limited to the size of your setup. The same would happen to HA VMs running on the same host as the hosted engine even if the cluster had 50 hosts... About the recommended way of engine deployment: It really is about whether you can tolerate your engine to be down for a longer time (starting another host using a backup db). Hosted engine restores your management in an automated way and without any data loss. However I agree that the fact that you have to tend to your HA VMs manually after an engine restart is not nice. Fortunately that should only happen when your host (or vdsm) dies and does not come up for an extended period of time. The summary would be.. there will be no HA handling if the host running the engine is down, independently on whether the deployment is hosted engine or standalone engine. If the issue is related to the software only then there is no real difference. - When a host with the standalone engine dies, the VMs are fine, but if anything happens while the engine is down (and reinstalling a standalone engine takes time + you need a very fresh db backup) you might again face issues with HA VMs being down or not starting when the engine comes up. - When a hosted engine dies because of a host failure, some VMs generally disappear with it. The engine will come up automatically and HA VMs from the original hosts have to be manually pushed to work. This requires some manual action, but I see it as less demanding than the first case. - When a hosted engine VM is stopped properly by the tooling it will be restarted elsewhere and it will be able to connect to the original host just fine. The engine will then make sure that all HA VMs are up even if the the VMs died while the engine was down. So I would recommend hosted engine based deployment. And ask for a bit of patience as we have a plan how to mitigate the second case to some extent without compromising the fencing storm prevention. Best regards -- Martin Sivak msivak@redhat.com SLA RHEV-M On Thu, Sep 24, 2015 at 2:31 PM, Michael Hölzl <mh@ins.jku.at> wrote:

...

Ok, thanks!

So, I would still like to know if you would recommend not to use hosted engines but rather another machine for the engine?

On 09/24/2015 01:24 PM, Martin Perina wrote:

...
...
From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com>, "Eli Mesika" <emesika@redhat.com> Cc: "Doron Fediuck" <dfediuck@redhat.com>, users@ovirt.org Sent: Thursday, September 24, 2015 12:35:13 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

thanks for the detailed answer! In principle, I understand the issue now. However, I can not fully follow the argument that this is a corner case. In a smaller or medium sized company, I would assume that such a setup, consisting of two machine with a hosted engine, is not uncommon. Especially as there is documentation online which describes how to deploy this setup. Does that mean that hosted engines are in general not recommended?

I am also wondering why the fencing could not be triggered by the hosted engine after the DisableFenceAtStartupInSec timeout? In the events log of the engine I keep on getting the message "Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued.", which would indicate that the engine is actually trying to fence the non responsive host. Unfortunately this is a bit misleading message, it's shown every time that we start handling network exception for the host and it's fired before

----- Original Message ----- the logic which manages to start/skip fencing process (this misleading message is fixed in 3.6). But in current logic we really execute fencing only when host status is about to change from Connecting to NonResponsive and this happens only for the 1st time when we are still in DisableFenceAtStartupInSec interval. During all other attempts the host is already in status Non Responsive, so fencing is skipped.

...
On 09/24/2015 11:50 AM, Martin Perina wrote:

...
----- Original Message -----

...
From: "Eli Mesika" <emesika@redhat.com> To: "Martin Perina" <mperina@redhat.com>, "Doron Fediuck" <dfediuck@redhat.com> Cc: "Michael Hölzl" <mh@ins.jku.at>, users@ovirt.org Sent: Thursday, September 24, 2015 11:38:39 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

...
From: "Martin Perina" <mperina@redhat.com> To: "Michael Hölzl" <mh@ins.jku.at> Cc: users@ovirt.org Sent: Thursday, September 24, 2015 11:02:21 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

sorry for the late response, but you hit a "corner case" :-(

Let me start explain you a few things first:

After startup of engine there's an internval during which fencing is disabled. It's called DisableFenceAtStartupInSec and by default it's set to 5 minutes. It can be changed using

engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which could happen in during power issues in whole DC: when both engine and hosts are started, for huge hosts it may take a lot of time until become up and VDSM start to communicate with engine. So usually engine is started first and without this interval engine will start fencing for hosts which are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine if there's not just communication issue between engine and host, so we cannot restart HA VMs on another host. The only thing we can do is to offer "Mark host as rebooted" manual option to administrator. If administrator execution this option, we try to restart HA VMs on different host ASAP, because admin took the responsibility of validation that VMs are really not running.

When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status

If some host become Non Resposive during DisableFenceAtStartupInSec interval we skip fencing and administator will see message in Events tab that host is Non Responsive, but fencing is disabled due to startup interval. So administrator have to take care of such host manually.

Now what happened in your case:

1. Hosted engine VM is running on host1 with other VMs 2. Status of host1 and host2 is Up 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine is running to detect issue with host1 and change its status to Non Responsive 4. In the meantime hosted engine VM is started on host2 -> it will read host status from DB, but all hosts are up -> it will try to communicate with host1, but it's unreachable -> so it changes host1 status Non Responsive and starts Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted because engine is still in DisableFenceAtStartupInSec

So in normal deployment (without hosted engine) admin is notified that host, where engine is running, crashed and was rebooted, so he has to take a look and do manual steps if needed.

In hosted engine deployment it's an issue because hosted engine VM can be restart on different host also in other cases then crashes (for example if host is overloaded hosted engine can stop hosted engine VM and restart it on different host, but this shouldn't happen too often).

At the moment the only solution for this is manual: let administrator to be notified that host engine VM is restarted on different host, so administrator can check manually what was the cause for this restart and execute manual steps if needed.

So to summarize: at the moment I don't see any reliable automatic solution for this :-( and fencing storm prevention is more important. But feel free to create a bug for this issue, maybe we can think of at least some improvement for this use case. Thanks for the detailed explanation Martin Really a corner case, lets see if we got more inputs on that from other users Maybe when hosted engine VM is restarted on another node we can ask for

----- Original Message ----- the reason and act accordingly Doron, with current implementation, is the reason for hosted engine VM restart stored anywhere ? I have already discussed this with Martin Sivak and hosted engine doesn't touch engine db at all. We discussed this possible solution with Martin, which we could do in master and maybe in 3.6 if agreed:

1. Just after start of engine we can read from the db name of the host which hosted engine VM is running on and store it somewhere in memory for Non Responding Treatment

2. As a part of Non Responding Treatment we can some hosted engine specific logic: IF we are running as hosted engine AND we are inside DisableFenceAtStartupInSec internal AND non responsive host is the host stored above in step 1. AND hosted engine VM is running on different host THEN execute fencing for non responsive host even when we are inside DisableFenceAtStartupInSec internal

But it can cause unnecessary fence for the case that whole datacenter recovers from power failure.

...
...
Thanks

Martin Perina

----- Original Message ----- > From: "Michael Hölzl" <mh@ins.jku.at> > To: "Martin Perina" <mperina@redhat.com> > Cc: users@ovirt.org > Sent: Monday, September 21, 2015 4:47:06 PM > Subject: Re: [ovirt-users] HA - Fencing not working when host with > engine > gets shutdown > > Hi, > > The whole engine.log including the shutdown time (was performed around > 9:19) > http://pastebin.com/cdY9uTkJ > > vdsm.log of host01 (the host which kept on running and took over the > engine) split into 3 uploads (limit of 512 kB of pastebin): > 1 : http://pastebin.com/dr9jNTek > 2 : http://pastebin.com/cuyHL6ne > 3 : http://pastebin.com/7x2ZQy1y > > Michael > > On 09/21/2015 03:00 PM, Martin Perina wrote: >> Hi, >> >> could you please post whole engine.log (from the time which you turned >> off >> the host with engine VM) and also vdsm.log from both hosts? >> >> Thanks >> >> Martin Perina >> >> ----- Original Message ----- >>> From: "Michael Hölzl" <mh@ins.jku.at> >>> To: users@ovirt.org >>> Sent: Monday, September 21, 2015 10:27:08 AM >>> Subject: [ovirt-users] HA - Fencing not working when host with engine >>> gets >>> shutdown >>> >>> Hi all, >>> >>> we are trying to setup an ovirt environment with two hosts, both >>> connected to a ISCSI storage device, a hosted engine and power >>> management configured over ILO. So far it seems to work fine in our >>> testing setup and starting/stopping VMs works smoothly with proper >>> scheduling between those hosts. So we wanted to test HA for the VMs >>> now >>> and started to manually shutdown a host while there are still VMs >>> running on that machine (to simulate power failure or a kernel panic). >>> The expected outcome was that all machines were HA is enabled, are >>> booted again. This works if the machine with the failure does not have >>> the engine running. If the machine with the hosted engine VM gets >>> shutdown, the host gets in the "Not Responsive state" and all VMs end >>> up >>> in an unkown state. However, the engine itself starts correctly on the >>> second host and it seems like it tries to fence the other host (as >>> expected) - Events which we get in the open virtualization manager: >>> 1. Host hosted_engine_2 is non responsive >>> 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to >>> execute Status command on Host hosted_engine_2. >>> 3. Host hosted_engine_2 became non responsive. It has no power >>> management configured. Please check the host status, manually reboot >>> it, >>> and click "Confirm Host Has Been Rebooted" >>> 4. Host hosted_engine_2 is not responding. It will stay in Connecting >>> state for a grace period of 124 seconds and after that an attempt to >>> fence the host will be issued. >>> >>> Event 4 is continuously coming every 3 minutes. Complete engine.log >>> file >>> during engine boot up: http://pastebin.com/D6xS3Wfy >>> So the host detects the machine is not responding and wants to fence >>> it. >>> But although the host has power management configured over ILO, the >>> engine thinks that it is not. As a result the second host does not get >>> fenced and VMs are not migrated to the running machine. >>> In the log files there are also a lot of time out exception. But I >>> guess >>> that this is because the host cannot connect to the other machine. >>> >>> Did anybody face similar problems with HA? Or any clue what the >>> problem >>> might be? >>> >>> Thanks, >>> Michael >>> >>> >>> ---- >>> ovirt version: 3.5.4 >>> Hosted engine VM OS: Cent OS 6.5 >>> Host Machines OS: Cent OS 7 >>> >>> P.S. We also have to note that we had problems with the command >>> fence_ipmilan at the beginning. We were receiving the message "Unable >>> to >>> obtain correct plug status or plug is not available," whenever the >>> command fence_ipmilan was called. However, the command fence_ilo4 >>> worked. So we use a simple script for fence_ipmilan now that calls >>> fence_ilo4 and passes the arguments. >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >>> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Martin Perina

9:13 a.m.

I created a bug covering this: https://bugzilla.redhat.com/show_bug.cgi?id=1266099 ----- Original Message -----

...

From: "Martin Sivak" <msivak@redhat.com> To: "Michael Hölzl" <mh@ins.jku.at> Cc: "Martin Perina" <mperina@redhat.com>, users@ovirt.org Sent: Thursday, September 24, 2015 2:59:52 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi Michael,

Martin summed the situation neatly, I would just add that this issue is not limited to the size of your setup. The same would happen to HA VMs running on the same host as the hosted engine even if the cluster had 50 hosts...

About the recommended way of engine deployment: It really is about whether you can tolerate your engine to be down for a longer time (starting another host using a backup db).

Hosted engine restores your management in an automated way and without any data loss. However I agree that the fact that you have to tend to your HA VMs manually after an engine restart is not nice. Fortunately that should only happen when your host (or vdsm) dies and does not come up for an extended period of time.

The summary would be.. there will be no HA handling if the host running the engine is down, independently on whether the deployment is hosted engine or standalone engine. If the issue is related to the software only then there is no real difference.

- When a host with the standalone engine dies, the VMs are fine, but if anything happens while the engine is down (and reinstalling a standalone engine takes time + you need a very fresh db backup) you might again face issues with HA VMs being down or not starting when the engine comes up.

- When a hosted engine dies because of a host failure, some VMs generally disappear with it. The engine will come up automatically and HA VMs from the original hosts have to be manually pushed to work. This requires some manual action, but I see it as less demanding than the first case.

- When a hosted engine VM is stopped properly by the tooling it will be restarted elsewhere and it will be able to connect to the original host just fine. The engine will then make sure that all HA VMs are up even if the the VMs died while the engine was down.

So I would recommend hosted engine based deployment. And ask for a bit of patience as we have a plan how to mitigate the second case to some extent without compromising the fencing storm prevention.

Best regards

-- Martin Sivak msivak@redhat.com SLA RHEV-M

On Thu, Sep 24, 2015 at 2:31 PM, Michael Hölzl <mh@ins.jku.at> wrote:

...
Ok, thanks!

So, I would still like to know if you would recommend not to use hosted engines but rather another machine for the engine?

On 09/24/2015 01:24 PM, Martin Perina wrote:

...
...
From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com>, "Eli Mesika" <emesika@redhat.com> Cc: "Doron Fediuck" <dfediuck@redhat.com>, users@ovirt.org Sent: Thursday, September 24, 2015 12:35:13 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

thanks for the detailed answer! In principle, I understand the issue now. However, I can not fully follow the argument that this is a corner case. In a smaller or medium sized company, I would assume that such a setup, consisting of two machine with a hosted engine, is not uncommon. Especially as there is documentation online which describes how to deploy this setup. Does that mean that hosted engines are in general not recommended?

I am also wondering why the fencing could not be triggered by the hosted engine after the DisableFenceAtStartupInSec timeout? In the events log of the engine I keep on getting the message "Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued.", which would indicate that the engine is actually trying to fence the non responsive host. Unfortunately this is a bit misleading message, it's shown every time that we start handling network exception for the host and it's fired before

----- Original Message ----- the logic which manages to start/skip fencing process (this misleading message is fixed in 3.6). But in current logic we really execute fencing only when host status is about to change from Connecting to NonResponsive and this happens only for the 1st time when we are still in DisableFenceAtStartupInSec interval. During all other attempts the host is already in status Non Responsive, so fencing is skipped.

...
On 09/24/2015 11:50 AM, Martin Perina wrote:

...
----- Original Message -----

...
From: "Eli Mesika" <emesika@redhat.com> To: "Martin Perina" <mperina@redhat.com>, "Doron Fediuck" <dfediuck@redhat.com> Cc: "Michael Hölzl" <mh@ins.jku.at>, users@ovirt.org Sent: Thursday, September 24, 2015 11:38:39 AM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

----- Original Message ----- > From: "Martin Perina" <mperina@redhat.com> > To: "Michael Hölzl" <mh@ins.jku.at> > Cc: users@ovirt.org > Sent: Thursday, September 24, 2015 11:02:21 AM > Subject: Re: [ovirt-users] HA - Fencing not working when host with > engine > gets shutdown > > Hi, > > sorry for the late response, but you hit a "corner case" :-( > > Let me start explain you a few things first: > > After startup of engine there's an internval during which fencing is > disabled. It's called DisableFenceAtStartupInSec and by default it's > set to 5 minutes. It can be changed using > > engine-config -s DisableFenceAtStartupInSec > > but please do that with caution. > > Why do we have such timeout? It's a prevention of fencing storm, which > could happen in during power issues in whole DC: when both engine and > hosts are started, for huge hosts it may take a lot of time until > become > up and VDSM start to communicate with engine. So usually engine is > started > first and without this interval engine will start fencing for hosts > which > are just starting ... > > Another thing: if we cannot properly fence the host, we cannot > determine > if there's not just communication issue between engine and host, so we > cannot restart HA VMs on another host. The only thing we can do is to > offer "Mark host as rebooted" manual option to administrator. If > administrator execution this option, we try to restart HA VMs on > different > host ASAP, because admin took the responsibility of validation that > VMs > are really not running. > > > When engine is started, following actions related to fencing are > taken: > > 1. Get status of all hosts from DB and schedule Non Responding > Treatment > after DisableFenceAtStartupInSec timeout is passed > > 2. Try to communicate with all host and refresh their status > > > If some host become Non Resposive during DisableFenceAtStartupInSec > interval > we skip fencing and administator will see message in Events tab that > host > is Non Responsive, but fencing is disabled due to startup interval. So > administrator have to take care of such host manually. > > > Now what happened in your case: > > 1. Hosted engine VM is running on host1 with other VMs > 2. Status of host1 and host2 is Up > 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> > no > engine > is running to detect issue with host1 and change its status to Non > Responsive > 4. In the meantime hosted engine VM is started on host2 -> it will > read > host > status from DB, but all hosts are up -> it will try to communicate > with > host1, > but it's unreachable -> so it changes host1 status Non Responsive > and > starts > Non Responsive Treatment for host1 -> Non Responsive Treatment is > aborted > because engine is still in DisableFenceAtStartupInSec > > > So in normal deployment (without hosted engine) admin is notified that > host, > where engine is running, crashed and was rebooted, so he has to take a > look > and do manual steps if needed. > > In hosted engine deployment it's an issue because hosted engine VM can > be > restart > on different host also in other cases then crashes (for example if > host > is > overloaded hosted engine can stop hosted engine VM and restart it on > different > host, but this shouldn't happen too often). > > At the moment the only solution for this is manual: let administrator > to > be > notified that host engine VM is restarted on different host, so > administrator > can check manually what was the cause for this restart and execute > manual > steps > if needed. > > So to summarize: at the moment I don't see any reliable automatic > solution > for this :-( and fencing storm prevention is more important. But feel > free > to > create > a bug for this issue, maybe we can think of at least some improvement > for > this use > case. Thanks for the detailed explanation Martin Really a corner case, lets see if we got more inputs on that from other users Maybe when hosted engine VM is restarted on another node we can ask for the reason and act accordingly Doron, with current implementation, is the reason for hosted engine VM restart stored anywhere ? I have already discussed this with Martin Sivak and hosted engine doesn't touch engine db at all. We discussed this possible solution with Martin, which we could do in master and maybe in 3.6 if agreed:

1. Just after start of engine we can read from the db name of the host which hosted engine VM is running on and store it somewhere in memory for Non Responding Treatment

2. As a part of Non Responding Treatment we can some hosted engine specific logic: IF we are running as hosted engine AND we are inside DisableFenceAtStartupInSec internal AND non responsive host is the host stored above in step 1. AND hosted engine VM is running on different host THEN execute fencing for non responsive host even when we are inside DisableFenceAtStartupInSec internal

But it can cause unnecessary fence for the case that whole datacenter recovers from power failure.

...
> Thanks > > Martin Perina > > ----- Original Message ----- >> From: "Michael Hölzl" <mh@ins.jku.at> >> To: "Martin Perina" <mperina@redhat.com> >> Cc: users@ovirt.org >> Sent: Monday, September 21, 2015 4:47:06 PM >> Subject: Re: [ovirt-users] HA - Fencing not working when host with >> engine >> gets shutdown >> >> Hi, >> >> The whole engine.log including the shutdown time (was performed >> around >> 9:19) >> http://pastebin.com/cdY9uTkJ >> >> vdsm.log of host01 (the host which kept on running and took over the >> engine) split into 3 uploads (limit of 512 kB of pastebin): >> 1 : http://pastebin.com/dr9jNTek >> 2 : http://pastebin.com/cuyHL6ne >> 3 : http://pastebin.com/7x2ZQy1y >> >> Michael >> >> On 09/21/2015 03:00 PM, Martin Perina wrote: >>> Hi, >>> >>> could you please post whole engine.log (from the time which you >>> turned >>> off >>> the host with engine VM) and also vdsm.log from both hosts? >>> >>> Thanks >>> >>> Martin Perina >>> >>> ----- Original Message ----- >>>> From: "Michael Hölzl" <mh@ins.jku.at> >>>> To: users@ovirt.org >>>> Sent: Monday, September 21, 2015 10:27:08 AM >>>> Subject: [ovirt-users] HA - Fencing not working when host with >>>> engine >>>> gets >>>> shutdown >>>> >>>> Hi all, >>>> >>>> we are trying to setup an ovirt environment with two hosts, both >>>> connected to a ISCSI storage device, a hosted engine and power >>>> management configured over ILO. So far it seems to work fine in our >>>> testing setup and starting/stopping VMs works smoothly with proper >>>> scheduling between those hosts. So we wanted to test HA for the VMs >>>> now >>>> and started to manually shutdown a host while there are still VMs >>>> running on that machine (to simulate power failure or a kernel >>>> panic). >>>> The expected outcome was that all machines were HA is enabled, are >>>> booted again. This works if the machine with the failure does not >>>> have >>>> the engine running. If the machine with the hosted engine VM gets >>>> shutdown, the host gets in the "Not Responsive state" and all VMs >>>> end >>>> up >>>> in an unkown state. However, the engine itself starts correctly on >>>> the >>>> second host and it seems like it tries to fence the other host (as >>>> expected) - Events which we get in the open virtualization manager: >>>> 1. Host hosted_engine_2 is non responsive >>>> 2. Host hosted_engine_1 from cluster Default was chosen as a proxy >>>> to >>>> execute Status command on Host hosted_engine_2. >>>> 3. Host hosted_engine_2 became non responsive. It has no power >>>> management configured. Please check the host status, manually >>>> reboot >>>> it, >>>> and click "Confirm Host Has Been Rebooted" >>>> 4. Host hosted_engine_2 is not responding. It will stay in >>>> Connecting >>>> state for a grace period of 124 seconds and after that an attempt >>>> to >>>> fence the host will be issued. >>>> >>>> Event 4 is continuously coming every 3 minutes. Complete engine.log >>>> file >>>> during engine boot up: http://pastebin.com/D6xS3Wfy >>>> So the host detects the machine is not responding and wants to >>>> fence >>>> it. >>>> But although the host has power management configured over ILO, the >>>> engine thinks that it is not. As a result the second host does not >>>> get >>>> fenced and VMs are not migrated to the running machine. >>>> In the log files there are also a lot of time out exception. But I >>>> guess >>>> that this is because the host cannot connect to the other machine. >>>> >>>> Did anybody face similar problems with HA? Or any clue what the >>>> problem >>>> might be? >>>> >>>> Thanks, >>>> Michael >>>> >>>> >>>> ---- >>>> ovirt version: 3.5.4 >>>> Hosted engine VM OS: Cent OS 6.5 >>>> Host Machines OS: Cent OS 7 >>>> >>>> P.S. We also have to note that we had problems with the command >>>> fence_ipmilan at the beginning. We were receiving the message >>>> "Unable >>>> to >>>> obtain correct plug status or plug is not available," whenever the >>>> command fence_ipmilan was called. However, the command fence_ilo4 >>>> worked. So we use a simple script for fence_ipmilan now that calls >>>> fence_ilo4 and passes the arguments. >>>> _______________________________________________ >>>> Users mailing list >>>> Users@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/users >>>> > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Michael Hölzl

25 Sep 25 Sep

2:19 a.m.

Thanks for the help! I will definitely stay tuned with updates on this matter. Michael On 09/24/2015 03:13 PM, Martin Perina wrote:

...

I created a bug covering this:

https://bugzilla.redhat.com/show_bug.cgi?id=1266099

----- Original Message -----

...
From: "Martin Sivak" <msivak@redhat.com> To: "Michael Hölzl" <mh@ins.jku.at> Cc: "Martin Perina" <mperina@redhat.com>, users@ovirt.org Sent: Thursday, September 24, 2015 2:59:52 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi Michael,

Martin summed the situation neatly, I would just add that this issue is not limited to the size of your setup. The same would happen to HA VMs running on the same host as the hosted engine even if the cluster had 50 hosts...

About the recommended way of engine deployment: It really is about whether you can tolerate your engine to be down for a longer time (starting another host using a backup db).

Hosted engine restores your management in an automated way and without any data loss. However I agree that the fact that you have to tend to your HA VMs manually after an engine restart is not nice. Fortunately that should only happen when your host (or vdsm) dies and does not come up for an extended period of time.

The summary would be.. there will be no HA handling if the host running the engine is down, independently on whether the deployment is hosted engine or standalone engine. If the issue is related to the software only then there is no real difference.

- When a host with the standalone engine dies, the VMs are fine, but if anything happens while the engine is down (and reinstalling a standalone engine takes time + you need a very fresh db backup) you might again face issues with HA VMs being down or not starting when the engine comes up.

- When a hosted engine dies because of a host failure, some VMs generally disappear with it. The engine will come up automatically and HA VMs from the original hosts have to be manually pushed to work. This requires some manual action, but I see it as less demanding than the first case.

- When a hosted engine VM is stopped properly by the tooling it will be restarted elsewhere and it will be able to connect to the original host just fine. The engine will then make sure that all HA VMs are up even if the the VMs died while the engine was down.

So I would recommend hosted engine based deployment. And ask for a bit of patience as we have a plan how to mitigate the second case to some extent without compromising the fencing storm prevention.

Best regards

-- Martin Sivak msivak@redhat.com SLA RHEV-M

On Thu, Sep 24, 2015 at 2:31 PM, Michael Hölzl <mh@ins.jku.at> wrote:

...
Ok, thanks!

So, I would still like to know if you would recommend not to use hosted engines but rather another machine for the engine?

On 09/24/2015 01:24 PM, Martin Perina wrote:

...
...
From: "Michael Hölzl" <mh@ins.jku.at> To: "Martin Perina" <mperina@redhat.com>, "Eli Mesika" <emesika@redhat.com> Cc: "Doron Fediuck" <dfediuck@redhat.com>, users@ovirt.org Sent: Thursday, September 24, 2015 12:35:13 PM Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi,

thanks for the detailed answer! In principle, I understand the issue now. However, I can not fully follow the argument that this is a corner case. In a smaller or medium sized company, I would assume that such a setup, consisting of two machine with a hosted engine, is not uncommon. Especially as there is documentation online which describes how to deploy this setup. Does that mean that hosted engines are in general not recommended?

I am also wondering why the fencing could not be triggered by the hosted engine after the DisableFenceAtStartupInSec timeout? In the events log of the engine I keep on getting the message "Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 120 seconds and after that an attempt to fence the host will be issued.", which would indicate that the engine is actually trying to fence the non responsive host. Unfortunately this is a bit misleading message, it's shown every time that we start handling network exception for the host and it's fired before

----- Original Message ----- the logic which manages to start/skip fencing process (this misleading message is fixed in 3.6). But in current logic we really execute fencing only when host status is about to change from Connecting to NonResponsive and this happens only for the 1st time when we are still in DisableFenceAtStartupInSec interval. During all other attempts the host is already in status Non Responsive, so fencing is skipped.

...
On 09/24/2015 11:50 AM, Martin Perina wrote:

...
----- Original Message ----- > From: "Eli Mesika" <emesika@redhat.com> > To: "Martin Perina" <mperina@redhat.com>, "Doron Fediuck" > <dfediuck@redhat.com> > Cc: "Michael Hölzl" <mh@ins.jku.at>, users@ovirt.org > Sent: Thursday, September 24, 2015 11:38:39 AM > Subject: Re: [ovirt-users] HA - Fencing not working when host with > engine > gets shutdown > > > > ----- Original Message ----- >> From: "Martin Perina" <mperina@redhat.com> >> To: "Michael Hölzl" <mh@ins.jku.at> >> Cc: users@ovirt.org >> Sent: Thursday, September 24, 2015 11:02:21 AM >> Subject: Re: [ovirt-users] HA - Fencing not working when host with >> engine >> gets shutdown >> >> Hi, >> >> sorry for the late response, but you hit a "corner case" :-( >> >> Let me start explain you a few things first: >> >> After startup of engine there's an internval during which fencing is >> disabled. It's called DisableFenceAtStartupInSec and by default it's >> set to 5 minutes. It can be changed using >> >> engine-config -s DisableFenceAtStartupInSec >> >> but please do that with caution. >> >> Why do we have such timeout? It's a prevention of fencing storm, which >> could happen in during power issues in whole DC: when both engine and >> hosts are started, for huge hosts it may take a lot of time until >> become >> up and VDSM start to communicate with engine. So usually engine is >> started >> first and without this interval engine will start fencing for hosts >> which >> are just starting ... >> >> Another thing: if we cannot properly fence the host, we cannot >> determine >> if there's not just communication issue between engine and host, so we >> cannot restart HA VMs on another host. The only thing we can do is to >> offer "Mark host as rebooted" manual option to administrator. If >> administrator execution this option, we try to restart HA VMs on >> different >> host ASAP, because admin took the responsibility of validation that >> VMs >> are really not running. >> >> >> When engine is started, following actions related to fencing are >> taken: >> >> 1. Get status of all hosts from DB and schedule Non Responding >> Treatment >> after DisableFenceAtStartupInSec timeout is passed >> >> 2. Try to communicate with all host and refresh their status >> >> >> If some host become Non Resposive during DisableFenceAtStartupInSec >> interval >> we skip fencing and administator will see message in Events tab that >> host >> is Non Responsive, but fencing is disabled due to startup interval. So >> administrator have to take care of such host manually. >> >> >> Now what happened in your case: >> >> 1. Hosted engine VM is running on host1 with other VMs >> 2. Status of host1 and host2 is Up >> 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> >> no >> engine >> is running to detect issue with host1 and change its status to Non >> Responsive >> 4. In the meantime hosted engine VM is started on host2 -> it will >> read >> host >> status from DB, but all hosts are up -> it will try to communicate >> with >> host1, >> but it's unreachable -> so it changes host1 status Non Responsive >> and >> starts >> Non Responsive Treatment for host1 -> Non Responsive Treatment is >> aborted >> because engine is still in DisableFenceAtStartupInSec >> >> >> So in normal deployment (without hosted engine) admin is notified that >> host, >> where engine is running, crashed and was rebooted, so he has to take a >> look >> and do manual steps if needed. >> >> In hosted engine deployment it's an issue because hosted engine VM can >> be >> restart >> on different host also in other cases then crashes (for example if >> host >> is >> overloaded hosted engine can stop hosted engine VM and restart it on >> different >> host, but this shouldn't happen too often). >> >> At the moment the only solution for this is manual: let administrator >> to >> be >> notified that host engine VM is restarted on different host, so >> administrator >> can check manually what was the cause for this restart and execute >> manual >> steps >> if needed. >> >> So to summarize: at the moment I don't see any reliable automatic >> solution >> for this :-( and fencing storm prevention is more important. But feel >> free >> to >> create >> a bug for this issue, maybe we can think of at least some improvement >> for >> this use >> case. > Thanks for the detailed explanation Martin > Really a corner case, lets see if we got more inputs on that from other > users > Maybe when hosted engine VM is restarted on another node we can ask for > the > reason and act accordingly > Doron, with current implementation, is the reason for hosted engine VM > restart stored anywhere ? I have already discussed this with Martin Sivak and hosted engine doesn't touch engine db at all. We discussed this possible solution with Martin, which we could do in master and maybe in 3.6 if agreed:

1. Just after start of engine we can read from the db name of the host which hosted engine VM is running on and store it somewhere in memory for Non Responding Treatment

2. As a part of Non Responding Treatment we can some hosted engine specific logic: IF we are running as hosted engine AND we are inside DisableFenceAtStartupInSec internal AND non responsive host is the host stored above in step 1. AND hosted engine VM is running on different host THEN execute fencing for non responsive host even when we are inside DisableFenceAtStartupInSec internal

But it can cause unnecessary fence for the case that whole datacenter recovers from power failure.

>> Thanks >> >> Martin Perina >> >> ----- Original Message ----- >>> From: "Michael Hölzl" <mh@ins.jku.at> >>> To: "Martin Perina" <mperina@redhat.com> >>> Cc: users@ovirt.org >>> Sent: Monday, September 21, 2015 4:47:06 PM >>> Subject: Re: [ovirt-users] HA - Fencing not working when host with >>> engine >>> gets shutdown >>> >>> Hi, >>> >>> The whole engine.log including the shutdown time (was performed >>> around >>> 9:19) >>> http://pastebin.com/cdY9uTkJ >>> >>> vdsm.log of host01 (the host which kept on running and took over the >>> engine) split into 3 uploads (limit of 512 kB of pastebin): >>> 1 : http://pastebin.com/dr9jNTek >>> 2 : http://pastebin.com/cuyHL6ne >>> 3 : http://pastebin.com/7x2ZQy1y >>> >>> Michael >>> >>> On 09/21/2015 03:00 PM, Martin Perina wrote: >>>> Hi, >>>> >>>> could you please post whole engine.log (from the time which you >>>> turned >>>> off >>>> the host with engine VM) and also vdsm.log from both hosts? >>>> >>>> Thanks >>>> >>>> Martin Perina >>>> >>>> ----- Original Message ----- >>>>> From: "Michael Hölzl" <mh@ins.jku.at> >>>>> To: users@ovirt.org >>>>> Sent: Monday, September 21, 2015 10:27:08 AM >>>>> Subject: [ovirt-users] HA - Fencing not working when host with >>>>> engine >>>>> gets >>>>> shutdown >>>>> >>>>> Hi all, >>>>> >>>>> we are trying to setup an ovirt environment with two hosts, both >>>>> connected to a ISCSI storage device, a hosted engine and power >>>>> management configured over ILO. So far it seems to work fine in our >>>>> testing setup and starting/stopping VMs works smoothly with proper >>>>> scheduling between those hosts. So we wanted to test HA for the VMs >>>>> now >>>>> and started to manually shutdown a host while there are still VMs >>>>> running on that machine (to simulate power failure or a kernel >>>>> panic). >>>>> The expected outcome was that all machines were HA is enabled, are >>>>> booted again. This works if the machine with the failure does not >>>>> have >>>>> the engine running. If the machine with the hosted engine VM gets >>>>> shutdown, the host gets in the "Not Responsive state" and all VMs >>>>> end >>>>> up >>>>> in an unkown state. However, the engine itself starts correctly on >>>>> the >>>>> second host and it seems like it tries to fence the other host (as >>>>> expected) - Events which we get in the open virtualization manager: >>>>> 1. Host hosted_engine_2 is non responsive >>>>> 2. Host hosted_engine_1 from cluster Default was chosen as a proxy >>>>> to >>>>> execute Status command on Host hosted_engine_2. >>>>> 3. Host hosted_engine_2 became non responsive. It has no power >>>>> management configured. Please check the host status, manually >>>>> reboot >>>>> it, >>>>> and click "Confirm Host Has Been Rebooted" >>>>> 4. Host hosted_engine_2 is not responding. It will stay in >>>>> Connecting >>>>> state for a grace period of 124 seconds and after that an attempt >>>>> to >>>>> fence the host will be issued. >>>>> >>>>> Event 4 is continuously coming every 3 minutes. Complete engine.log >>>>> file >>>>> during engine boot up: http://pastebin.com/D6xS3Wfy >>>>> So the host detects the machine is not responding and wants to >>>>> fence >>>>> it. >>>>> But although the host has power management configured over ILO, the >>>>> engine thinks that it is not. As a result the second host does not >>>>> get >>>>> fenced and VMs are not migrated to the running machine. >>>>> In the log files there are also a lot of time out exception. But I >>>>> guess >>>>> that this is because the host cannot connect to the other machine. >>>>> >>>>> Did anybody face similar problems with HA? Or any clue what the >>>>> problem >>>>> might be? >>>>> >>>>> Thanks, >>>>> Michael >>>>> >>>>> >>>>> ---- >>>>> ovirt version: 3.5.4 >>>>> Hosted engine VM OS: Cent OS 6.5 >>>>> Host Machines OS: Cent OS 7 >>>>> >>>>> P.S. We also have to note that we had problems with the command >>>>> fence_ipmilan at the beginning. We were receiving the message >>>>> "Unable >>>>> to >>>>> obtain correct plug status or plug is not available," whenever the >>>>> command fence_ipmilan was called. However, the command fence_ilo4 >>>>> worked. So we use a simple script for fence_ipmilan now that calls >>>>> fence_ilo4 and passes the arguments. >>>>> _______________________________________________ >>>>> Users mailing list >>>>> Users@ovirt.org >>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >>

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Michael Hölzl

24 Sep 24 Sep

2:16 a.m.

Anybody an idea why HA VMs do not start and fencing is not working? Thanks, Michael On 09/21/2015 03:00 PM, Martin Perina wrote:

...

Hi,

could you please post whole engine.log (from the time which you turned off the host with engine VM) and also vdsm.log from both hosts?

Thanks

Martin Perina

----- Original Message -----

...
From: "Michael Hölzl" <mh@ins.jku.at> To: users@ovirt.org Sent: Monday, September 21, 2015 10:27:08 AM Subject: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

Hi all,

we are trying to setup an ovirt environment with two hosts, both connected to a ISCSI storage device, a hosted engine and power management configured over ILO. So far it seems to work fine in our testing setup and starting/stopping VMs works smoothly with proper scheduling between those hosts. So we wanted to test HA for the VMs now and started to manually shutdown a host while there are still VMs running on that machine (to simulate power failure or a kernel panic). The expected outcome was that all machines were HA is enabled, are booted again. This works if the machine with the failure does not have the engine running. If the machine with the hosted engine VM gets shutdown, the host gets in the "Not Responsive state" and all VMs end up in an unkown state. However, the engine itself starts correctly on the second host and it seems like it tries to fence the other host (as expected) - Events which we get in the open virtualization manager: 1. Host hosted_engine_2 is non responsive 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to execute Status command on Host hosted_engine_2. 3. Host hosted_engine_2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" 4. Host hosted_engine_2 is not responding. It will stay in Connecting state for a grace period of 124 seconds and after that an attempt to fence the host will be issued.

Event 4 is continuously coming every 3 minutes. Complete engine.log file during engine boot up: http://pastebin.com/D6xS3Wfy So the host detects the machine is not responding and wants to fence it. But although the host has power management configured over ILO, the engine thinks that it is not. As a result the second host does not get fenced and VMs are not migrated to the running machine. In the log files there are also a lot of time out exception. But I guess that this is because the host cannot connect to the other machine.

Did anybody face similar problems with HA? Or any clue what the problem might be?

Thanks, Michael

---- ovirt version: 3.5.4 Hosted engine VM OS: Cent OS 6.5 Host Machines OS: Cent OS 7

P.S. We also have to note that we had problems with the command fence_ipmilan at the beginning. We were receiving the message "Unable to obtain correct plug status or plug is not available," whenever the command fence_ipmilan was called. However, the command fence_ilo4 worked. So we use a simple script for fence_ipmilan now that calls fence_ilo4 and passes the arguments. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

3710

Age (days ago)

3714

Last active (days ago)

List overview

Download

12 comments

4 participants

participants (4)

Eli Mesika
Martin Perina
Martin Sivak
Michael Hölzl

HA - Fencing not working when host with engine gets shutdown

tags

participants (4)