[Users] two node ovirt cluster with HA

older
[Users] Ovirt 3.4 - Fail to set...

Jaison peter

27 Jan 2014 27 Jan '14

7:12 a.m.

Hi all , I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice . Thanks & Regards Alex

Attachments:

attachment.html (text/html — 683 bytes)

Show replies by date

Andrew Lau

27 Jan 27 Jan

11:11 a.m.

Hi, Have you got power management enabled? That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts). Andrew. On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com> wrote:

...

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Dafna Ron

12:11 p.m.

Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down. since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption. The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful). VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it). hope this clears things up Dafna On 01/27/2014 10:11 AM, Andrew Lau wrote:

...

Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron

Karli Sjöberg

12:48 p.m.

On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

...

Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down.

Can´t it at least check with power management if the Host status is down first? I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me... Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually. /K

...

since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

...
Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Med Vänliga Hälsningar ------------------------------------------------------------------------------- Karli Sjöberg Swedish University of Agricultural Sciences Box 7079 (Visiting Address Kronåsvägen 8) S-750 07 Uppsala, Sweden Phone: +46-(0)18-67 15 66 karli.sjoberg@slu.se

Dafna Ron

1:05 p.m.

I am adding Tareq for the Power Management implementation. Dafna On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

...

On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

...
Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down. Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

...
since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

...
Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron

Tareq Alayan

1:43 p.m.

Hi, Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)... In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption. N-joy, --Tareq On 01/27/2014 02:05 PM, Dafna Ron wrote:

...

I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

...
On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

...
Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down. Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

...
since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

...
Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Andrew Lau

1:50 p.m.

Hi, I think he was asking what if the power management device reported that the host was powered off. Then VMs should be brought back up as being off would essentially be the same as running a power cycle/reboot? Another example I'm seeing is what happens if the whole host loses power and it's power management device then becomes unavailable (ie. not reachable) then you're stuck in the case where it requires manual intervention. I would be interested to potentially see something like a timeout on those problematic VMs (eg. if nothing was read or write after x amount of time) then you could consider the host as offline? I guess then that adds a lot of risk.. On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <talayan@redhat.com> wrote:

...

Hi,

Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)...

In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption.

N-joy,

--Tareq

On 01/27/2014 02:05 PM, Dafna Ron wrote:

...
I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

...
On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

...
Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down.

Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

since vm's can continue running on the host even if engine has no access

...
to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

...
Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Tareq Alayan

1:59 p.m.

This is a multi-part message in MIME format. --------------020302000209030903080908 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Adding Eli. On 01/27/2014 02:50 PM, Andrew Lau wrote:

...

Hi,

I think he was asking what if the power management device reported that the host was powered off. Then VMs should be brought back up as being off would essentially be the same as running a power cycle/reboot?

Another example I'm seeing is what happens if the whole host loses power and it's power management device then becomes unavailable (ie. not reachable) then you're stuck in the case where it requires manual intervention.

I would be interested to potentially see something like a timeout on those problematic VMs (eg. if nothing was read or write after x amount of time) then you could consider the host as offline? I guess then that adds a lot of risk..

On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <talayan@redhat.com <mailto:talayan@redhat.com>> wrote:

Hi,

Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)...

In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption.

N-joy,

--Tareq

On 01/27/2014 02:05 PM, Dafna Ron wrote:

I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down.

CanŽt it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if itŽs a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com> <mailto:urotrip2@gmail.com <mailto:urotrip2@gmail.com>>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> <mailto:Users@ovirt.org <mailto:Users@ovirt.org>> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

--------------020302000209030903080908 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> Adding Eli. <div class="moz-cite-prefix">On 01/27/2014 02:50 PM, Andrew Lau wrote: </div> <blockquote cite="mid:CAD7dF9fNhvSsd+Oj2s+rJo4oSkwnZiU6H23tbY0XL12_JShfsw@mail.gmail.com" type="cite"> <div dir="ltr"> <div class="gmail_default" style="font-family:tahoma,sans-serif">Hi,</div> <div class="gmail_default" style="font-family:tahoma,sans-serif"> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif"> I think he was asking what if the power management device reported that the host was powered off. Then VMs should be brought back up as being off would essentially be the same as running a power cycle/reboot?</div> <div class="gmail_default" style="font-family:tahoma,sans-serif"> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif">Another example I'm seeing is what happens if the whole host loses power and it's power management device then becomes unavailable (ie. not reachable) then you're stuck in the case where it requires manual intervention. </div> <div class="gmail_default" style="font-family:tahoma,sans-serif"> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif">I would be interested to potentially see something like a timeout on those problematic VMs (eg. if nothing was read or write after x amount of time) then you could consider the host as offline? I guess then that adds a lot of risk..</div> <div class="gmail_extra"> <div class="gmail_quote">On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <<a moz-do-not-send="true" href="mailto:talayan@redhat.com" target="_blank">talayan@redhat.com</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hi, Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)... In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption. N-joy, --Tareq <div class="HOEnZb"> <div class="h5"> On 01/27/2014 02:05 PM, Dafna Ron wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> I am adding Tareq for the Power Management implementation. Dafna On 01/27/2014 11:48 AM, Karli Sjöberg wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down. </blockquote> Can´t it at least check with power management if the Host status is down first? I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me... Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually. /K <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption. The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful). VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it). hope this clears things up Dafna On 01/27/2014 10:11 AM, Andrew Lau wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hi, Have you got power management enabled? That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts). Andrew. On Jan 27, 2014 5:12 PM, "Jaison peter" <<a moz-do-not-send="true" href="mailto:urotrip2@gmail.com" target="_blank">urotrip2@gmail.com</a> <mailto:<a moz-do-not-send="true" href="mailto:urotrip2@gmail.com" target="_blank">urotrip2@gmail.com</a>>> wrote: Hi all , I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice . Thanks & Regards Alex _______________________________________________ Users mailing list <a moz-do-not-send="true" href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a> <mailto:<a moz-do-not-send="true" href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a>> <a moz-do-not-send="true" href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a> _______________________________________________ Users mailing list <a moz-do-not-send="true" href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a> <a moz-do-not-send="true" href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a> </blockquote> -- Dafna Ron _______________________________________________ Users mailing list <a moz-do-not-send="true" href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a> <a moz-do-not-send="true" href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a> </blockquote> </blockquote> </blockquote> </div> </div> </blockquote> </div> </div> </div> </blockquote> </body> </html> --------------020302000209030903080908--

Dafna Ron

2:02 p.m.

Andrew, Once this discussion is finished, and If what you like done is not in the current implementation can you please open a bug/feature request for it? Thanks, Dafna On 01/27/2014 12:59 PM, Tareq Alayan wrote:

...

Adding Eli.

On 01/27/2014 02:50 PM, Andrew Lau wrote:

...
Hi,

I think he was asking what if the power management device reported that the host was powered off. Then VMs should be brought back up as being off would essentially be the same as running a power cycle/reboot?

Another example I'm seeing is what happens if the whole host loses power and it's power management device then becomes unavailable (ie. not reachable) then you're stuck in the case where it requires manual intervention.

I would be interested to potentially see something like a timeout on those problematic VMs (eg. if nothing was read or write after x amount of time) then you could consider the host as offline? I guess then that adds a lot of risk..

On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <talayan@redhat.com <mailto:talayan@redhat.com>> wrote:

Hi,

Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)...

In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption.

N-joy,

--Tareq

On 01/27/2014 02:05 PM, Dafna Ron wrote:

I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down.

Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com> <mailto:urotrip2@gmail.com <mailto:urotrip2@gmail.com>>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> <mailto:Users@ovirt.org <mailto:Users@ovirt.org>> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron

Andrew Lau

28 Jan 28 Jan

2:12 p.m.

On Tue, Jan 28, 2014 at 12:02 AM, Dafna Ron <dron@redhat.com> wrote:

...

Andrew, Once this discussion is finished, and If what you like done is not in the current implementation can you please open a bug/feature request for it?

Sure - I've opened a RFE here based on the current discussions https://bugzilla.redhat.com/show_bug.cgi?id=1058737 but I'm not sure which category it should be under. Cheers, Andrew.

Dafna Ron

6:09 p.m.

On 01/28/2014 01:12 PM, Andrew Lau wrote:

...

On Tue, Jan 28, 2014 at 12:02 AM, Dafna Ron <dron@redhat.com <mailto:dron@redhat.com>>wrote:

Andrew, Once this discussion is finished, and If what you like done is not in the current implementation can you please open a bug/feature request for it?

Sure - I've opened a RFE here based on the current discussions https://bugzilla.redhat.com/show_bug.cgi?id=1058737 but I'm not sure which category it should be under.

Cheers, Andrew.

Thanks Andrew! I really appreciate it :) -- Dafna Ron

Eli Mesika

29 Jan 29 Jan

9:04 a.m.

----- Original Message -----

...

From: "Andrew Lau" <andrew@andrewklau.com> To: dron@redhat.com Cc: "Tareq Alayan" <talayan@redhat.com>, "Eli Mesika" <emesika@redhat.com>, "Karli Sjöberg" <Karli.Sjoberg@slu.se>, users@ovirt.org Sent: Tuesday, January 28, 2014 3:12:46 PM Subject: Re: [Users] two node ovirt cluster with HA

On Tue, Jan 28, 2014 at 12:02 AM, Dafna Ron <dron@redhat.com> wrote:

...
Andrew, Once this discussion is finished, and If what you like done is not in the current implementation can you please open a bug/feature request for it?

Sure - I've opened a RFE here based on the current discussions https://bugzilla.redhat.com/show_bug.cgi?id=1058737 but I'm not sure which category it should be under.

I had assigned it to infra , thanks IMHO we should handle only the first scenario reported in this BZ

...

Cheers, Andrew.

Eli Mesika

27 Jan 27 Jan

4:40 p.m.

----- Original Message -----

...

From: "Tareq Alayan" <talayan@redhat.com> To: "Andrew Lau" <andrew@andrewklau.com>, "Eli Mesika" <emesika@redhat.com> Cc: dron@redhat.com, "Karli Sjöberg" <Karli.Sjoberg@slu.se>, users@ovirt.org Sent: Monday, January 27, 2014 2:59:02 PM Subject: Re: [Users] two node ovirt cluster with HA

Adding Eli.

I just want to summarize the requirement as I understand it: In the case that a Host that is running HA VMs and have PM configured is turned off manually : 1) The non-responsive treatment should be modified to check Host status via PM agent 2) If Host is off , HA VMs will attempt to run on another host ASAP 3) The host status should be set to DOWN 4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done Is the above correct? if so , a RFE on that can be opened

...

On 01/27/2014 02:50 PM, Andrew Lau wrote:

...
Hi,

I think he was asking what if the power management device reported that the host was powered off. Then VMs should be brought back up as being off would essentially be the same as running a power cycle/reboot?

Another example I'm seeing is what happens if the whole host loses power and it's power management device then becomes unavailable (ie. not reachable) then you're stuck in the case where it requires manual intervention.

I would be interested to potentially see something like a timeout on those problematic VMs (eg. if nothing was read or write after x amount of time) then you could consider the host as offline? I guess then that adds a lot of risk..

On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <talayan@redhat.com <mailto:talayan@redhat.com>> wrote:

Hi,

Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)...

In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption.

N-joy,

--Tareq

On 01/27/2014 02:05 PM, Dafna Ron wrote:

I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down.

Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com> <mailto:urotrip2@gmail.com <mailto:urotrip2@gmail.com>>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> <mailto:Users@ovirt.org <mailto:Users@ovirt.org>> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

Jaison peter

28 Jan 28 Jan

6:33 a.m.

Thank you all for your valuable feedback . Can you please specify some of the supported fencing devices in ovirt ? On Mon, Jan 27, 2014 at 9:10 PM, Eli Mesika <emesika@redhat.com> wrote:

...

----- Original Message -----

...
From: "Tareq Alayan" <talayan@redhat.com> To: "Andrew Lau" <andrew@andrewklau.com>, "Eli Mesika" < emesika@redhat.com> Cc: dron@redhat.com, "Karli Sjöberg" <Karli.Sjoberg@slu.se>, users@ovirt.org Sent: Monday, January 27, 2014 2:59:02 PM Subject: Re: [Users] two node ovirt cluster with HA

Adding Eli.

I just want to summarize the requirement as I understand it:

In the case that a Host that is running HA VMs and have PM configured is turned off manually :

1) The non-responsive treatment should be modified to check Host status via PM agent 2) If Host is off , HA VMs will attempt to run on another host ASAP 3) The host status should be set to DOWN 4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done

Is the above correct? if so , a RFE on that can be opened

...
On 01/27/2014 02:50 PM, Andrew Lau wrote:

...
Hi,

I think he was asking what if the power management device reported that the host was powered off. Then VMs should be brought back up as being off would essentially be the same as running a power

cycle/reboot?

...
...
Another example I'm seeing is what happens if the whole host loses power and it's power management device then becomes unavailable (ie. not reachable) then you're stuck in the case where it requires manual intervention.

I would be interested to potentially see something like a timeout on those problematic VMs (eg. if nothing was read or write after x amount of time) then you could consider the host as offline? I guess then that adds a lot of risk..

On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <talayan@redhat.com <mailto:talayan@redhat.com>> wrote:

Hi,

Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)...

In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption.

N-joy,

--Tareq

On 01/27/2014 02:05 PM, Dafna Ron wrote:

I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down.

Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com> <mailto:urotrip2@gmail.com <mailto:urotrip2@gmail.com>>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

...
...
Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> <mailto:Users@ovirt.org <mailto:Users@ovirt.org>> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Eli Mesika

9:34 a.m.

----- Original Message -----

...

From: "Jaison peter" <urotrip2@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org, "Tareq Alayan" <talayan@redhat.com> Sent: Tuesday, January 28, 2014 7:33:35 AM Subject: Re: [Users] two node ovirt cluster with HA

Thank you all for your valuable feedback .

Can you please specify some of the supported fencing devices in ovirt ?

For oVirt 3.4 : apc,apc_snmp,bladecenter,cisco_ucs,drac5,drac7,eps,hpblade,ilo,ilo2,ilo3,ilo4,ipmilan,rsa,rsb,wti

...

On Mon, Jan 27, 2014 at 9:10 PM, Eli Mesika <emesika@redhat.com> wrote:

...
----- Original Message -----

...
From: "Tareq Alayan" <talayan@redhat.com> To: "Andrew Lau" <andrew@andrewklau.com>, "Eli Mesika" < emesika@redhat.com> Cc: dron@redhat.com, "Karli Sjöberg" <Karli.Sjoberg@slu.se>, users@ovirt.org Sent: Monday, January 27, 2014 2:59:02 PM Subject: Re: [Users] two node ovirt cluster with HA

Adding Eli.

I just want to summarize the requirement as I understand it:

In the case that a Host that is running HA VMs and have PM configured is turned off manually :

1) The non-responsive treatment should be modified to check Host status via PM agent 2) If Host is off , HA VMs will attempt to run on another host ASAP 3) The host status should be set to DOWN 4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done

Is the above correct? if so , a RFE on that can be opened

...
On 01/27/2014 02:50 PM, Andrew Lau wrote:

...
Hi,

I think he was asking what if the power management device reported that the host was powered off. Then VMs should be brought back up as being off would essentially be the same as running a power

cycle/reboot?

...
...
Another example I'm seeing is what happens if the whole host loses power and it's power management device then becomes unavailable (ie. not reachable) then you're stuck in the case where it requires manual intervention.

I would be interested to potentially see something like a timeout on those problematic VMs (eg. if nothing was read or write after x amount of time) then you could consider the host as offline? I guess then that adds a lot of risk..

On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <talayan@redhat.com <mailto:talayan@redhat.com>> wrote:

Hi,

Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)...

In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption.

N-joy,

--Tareq

On 01/27/2014 02:05 PM, Dafna Ron wrote:

I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down.

Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com> <mailto:urotrip2@gmail.com <mailto:urotrip2@gmail.com>>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

...
...
Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> <mailto:Users@ovirt.org <mailto:Users@ovirt.org>> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Jaison peter

9:40 a.m.

Thanks ! On Tue, Jan 28, 2014 at 2:04 PM, Eli Mesika <emesika@redhat.com> wrote:

...

----- Original Message -----

...
From: "Jaison peter" <urotrip2@gmail.com> To: "Eli Mesika" <emesika@redhat.com> Cc: users@ovirt.org, "Tareq Alayan" <talayan@redhat.com> Sent: Tuesday, January 28, 2014 7:33:35 AM Subject: Re: [Users] two node ovirt cluster with HA

Thank you all for your valuable feedback .

Can you please specify some of the supported fencing devices in ovirt ?

For oVirt 3.4 :

apc,apc_snmp,bladecenter,cisco_ucs,drac5,drac7,eps,hpblade,ilo,ilo2,ilo3,ilo4,ipmilan,rsa,rsb,wti

...
On Mon, Jan 27, 2014 at 9:10 PM, Eli Mesika <emesika@redhat.com> wrote:

...
----- Original Message -----

...
From: "Tareq Alayan" <talayan@redhat.com> To: "Andrew Lau" <andrew@andrewklau.com>, "Eli Mesika" < emesika@redhat.com> Cc: dron@redhat.com, "Karli Sjöberg" <Karli.Sjoberg@slu.se>, users@ovirt.org Sent: Monday, January 27, 2014 2:59:02 PM Subject: Re: [Users] two node ovirt cluster with HA

Adding Eli.

I just want to summarize the requirement as I understand it:

In the case that a Host that is running HA VMs and have PM configured

...
...
turned off manually :

1) The non-responsive treatment should be modified to check Host status via PM agent 2) If Host is off , HA VMs will attempt to run on another host ASAP 3) The host status should be set to DOWN 4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done

Is the above correct? if so , a RFE on that can be opened

...
On 01/27/2014 02:50 PM, Andrew Lau wrote:

...
Hi,

I think he was asking what if the power management device reported that the host was powered off. Then VMs should be brought back up

as

...
...
being off would essentially be the same as running a power cycle/reboot?

Another example I'm seeing is what happens if the whole host loses power and it's power management device then becomes unavailable (ie. not reachable) then you're stuck in the case where it requires manual intervention.

I would be interested to potentially see something like a timeout on those problematic VMs (eg. if nothing was read or write after x amount of time) then you could consider the host as offline? I guess then that adds a lot of risk..

On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <talayan@redhat.com <mailto:talayan@redhat.com>> wrote:

Hi,

Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)...

In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption.

N-joy,

--Tareq

On 01/27/2014 02:05 PM, Dafna Ron wrote:

I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down.

Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and

...
...
...
...
vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's

running on

...
it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is

only if

...
Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should

be

...
migrated in case of failure 2. Engine has access to the host - so the failure

is

...
on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

Hi,

Have you got power management enabled?

That's the fencing feature required for the

engine

...
to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com

...
...
...
<mailto:urotrip2@gmail.com <mailto:urotrip2@gmail.com>>> wrote:

Hi all ,

I was setting a two node ovirt cluster

with

...
ovirt engine on seperate node . I completed the

configuration

...
and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is

is the that

...
...
...
...
not possible to setup a fully HA ovirt cluster with two nodes ?

or

...
else is that my configuration problem ? please advice .

Thanks & Regards

Alex

...
...
Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> <mailto:Users@ovirt.org <mailto:

Users@ovirt.org>>

...
http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Karli Sjöberg

9:30 a.m.

Skickat från min iPhone

...

27 jan 2014 kl. 16:40 skrev "Eli Mesika" <emesika@redhat.com>:

----- Original Message -----

...
From: "Tareq Alayan" <talayan@redhat.com> To: "Andrew Lau" <andrew@andrewklau.com>, "Eli Mesika" <emesika@redhat.com> Cc: dron@redhat.com, "Karli Sjöberg" <Karli.Sjoberg@slu.se>, users@ovirt.org Sent: Monday, January 27, 2014 2:59:02 PM Subject: Re: [Users] two node ovirt cluster with HA

Adding Eli.

I just want to summarize the requirement as I understand it:

In the case that a Host that is running HA VMs and have PM configured is turned off manually :

1) The non-responsive treatment should be modified to check Host status via PM agent 2) If Host is off , HA VMs will attempt to run on another host ASAP 3) The host status should be set to DOWN 4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done

Is the above correct? if so , a RFE on that can be opened

Spot on, that's exactly what I was trying to say! I'd very much like to see an RFE for that. /K

...

...
...
On 01/27/2014 02:50 PM, Andrew Lau wrote: Hi,

I think he was asking what if the power management device reported that the host was powered off. Then VMs should be brought back up as being off would essentially be the same as running a power cycle/reboot?

Another example I'm seeing is what happens if the whole host loses power and it's power management device then becomes unavailable (ie. not reachable) then you're stuck in the case where it requires manual intervention.

I would be interested to potentially see something like a timeout on those problematic VMs (eg. if nothing was read or write after x amount of time) then you could consider the host as offline? I guess then that adds a lot of risk..

On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <talayan@redhat.com <mailto:talayan@redhat.com>> wrote:

Hi,

Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)...

In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption.

N-joy,

--Tareq

On 01/27/2014 02:05 PM, Dafna Ron wrote:

I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down.

Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com> <mailto:urotrip2@gmail.com <mailto:urotrip2@gmail.com>>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> <mailto:Users@ovirt.org <mailto:Users@ovirt.org>> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

Karli Sjöberg

27 Jan 27 Jan

1:54 p.m.

On Mon, 2014-01-27 at 14:43 +0200, Tareq Alayan wrote:

...

Hi,

Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state, The abilities provided by all power management devices are: check status, start, stop and recycle (restart)...

In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption.

Exactly! If Engine was to first check towards the power management that the problematic/non-responsove Host indeed is down, there would be no risk of data corruption. That´s why I suggested a change in the HA flow to first check if the Host is indeed down, if so, just start the VM's on another Host. /K

...

Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption.

N-joy,

--Tareq

On 01/27/2014 02:05 PM, Dafna Ron wrote:

...
I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

...
On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

...
Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down. Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

...
since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

...
Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Eli Mesika

30 Jan 30 Jan

12:01 p.m.

----- Original Message -----

...

From: "Tareq Alayan" <talayan@redhat.com> To: dron@redhat.com, "Karli Sjöberg" <Karli.Sjoberg@slu.se> Cc: users@ovirt.org Sent: Monday, January 27, 2014 2:43:29 PM Subject: Re: [Users] two node ovirt cluster with HA

Hi,

Power management makes use of special *dedicated* hardware in order to restart hosts independently of host OS. The engine connects to a power management devices using a *dedicated* network IP address. The engine is capable of rebooting hosts that have entered a non-operational or non-responsive state,

non-operational is related to storage issues so the Host will not be restarted by PM in this case

...

The abilities provided by all power management devices are: check status, start, stop and recycle (restart)...

Only status, start, stop while restart is implemented as stop->wait to off status->start->wait to on status

...

In the case of non-responsive host: all of the VMs that are currently running on that host can also become non-responsive. However, the non-responsive host keeps locking the VM hard disk for all VMs it is running. Attempting to start a VM on a different host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Rebooting allows the engine to assume that the lock on a VM hard disk image has been released. The engine can know for sure that the problematic host has been rebooted via the power management device and then it can start a VM from the problematic host on another host without risking data corruption. Important note: A virtual machine that has been marked highly-available can not be safely started on a different host without the certainty that doing so will not cause data corruption.

N-joy,

--Tareq

On 01/27/2014 02:05 PM, Dafna Ron wrote:

...
I am adding Tareq for the Power Management implementation.

Dafna

On 01/27/2014 11:48 AM, Karli Sjöberg wrote:

...
On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:

...
Powering off the host will never trigger vm migration. As far as engine is concerned it just lost connection to the host, but has no way of telling if the host is down or if a router is down. Can´t it at least check with power management if the Host status is down first?

I mean, if the network is down there will be no response from either PM or Host. But if PM is up and can tell you that the Host is down, sounds rather clear cut to me...

Seems to me the VM's would be restarted sooner if the flow was altered to first check with PM if it´s a network or Host issue, and if Host issue, immediately restart VM's on another Host, instead of waiting for a potentially problematic Host to boot up eventually.

/K

...
since vm's can continue running on the host even if engine has no access to it, starting the vm's on the second host can cause split brain and data corruption.

The way that the engine knows what's going on is by sending heath check queries to the vdsm. Power management will try to reboot a host when the health checks to vdsm will not be answered. So... if engine gets no reply and has no way of rebooting the host, the host status will be changed to Non-Responsive and the vm's will be unknown because engine has no way of knowing what's happening with the vm's. Since reboot of the host will kill the vm's running on it - this will never cause any vm migration but... along with the High-Availability vm feature, you will be able to have some of the vm's re-started on the second host after the host reboot (and that is only if Power Management was confirmed as successful).

VM migration is only triggered when: 1. Cluster configuration states that the vm should be migrated in case of failure 2. Engine has access to the host - so the failure is on the storage side and not the host side. 3. the vms are not actively writing (although there might be a new RFE for it).

hope this clears things up

Dafna

On 01/27/2014 10:11 AM, Andrew Lau wrote:

...
Hi,

Have you got power management enabled?

That's the fencing feature required for the engine to ensure that the host is actually offline. It won't resume any other VMs to prevent potential VM corruption (eg. VM running on multiple hosts).

Andrew.

On Jan 27, 2014 5:12 PM, "Jaison peter" <urotrip2@gmail.com <mailto:urotrip2@gmail.com>> wrote:

Hi all ,

I was setting a two node ovirt cluster with ovirt engine on seperate node . I completed the configuration and tested VM live migrations with out any issues . Then for checking cluster HA I powered down one host and expected vms running on that host to be migrated to the other one . But nothing happened , Engine detected host as un-rechable and marked it as non-operational and vm ran on that host went to 'unknown state' . Is that not possible to setup a fully HA ovirt cluster with two nodes ? or else is that my configuration problem ? please advice .

Thanks & Regards

Alex

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Dafna Ron _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

4425

Age (days ago)

4428

Last active (days ago)

List overview

Download

18 comments

6 participants

participants (6)

Andrew Lau
Dafna Ron
Eli Mesika
Jaison peter
Karli Sjöberg
Tareq Alayan