[Users] two node ovirt cluster with HA
Karli Sjöberg
Karli.Sjoberg at slu.se
Tue Jan 28 08:30:30 UTC 2014
Skickat från min iPhone
> 27 jan 2014 kl. 16:40 skrev "Eli Mesika" <emesika at redhat.com>:
>
>
>
> ----- Original Message -----
>> From: "Tareq Alayan" <talayan at redhat.com>
>> To: "Andrew Lau" <andrew at andrewklau.com>, "Eli Mesika" <emesika at redhat.com>
>> Cc: dron at redhat.com, "Karli Sjöberg" <Karli.Sjoberg at slu.se>, users at ovirt.org
>> Sent: Monday, January 27, 2014 2:59:02 PM
>> Subject: Re: [Users] two node ovirt cluster with HA
>>
>> Adding Eli.
>
> I just want to summarize the requirement as I understand it:
>
> In the case that a Host that is running HA VMs and have PM configured is turned off manually :
>
> 1) The non-responsive treatment should be modified to check Host status via PM agent
> 2) If Host is off , HA VMs will attempt to run on another host ASAP
> 3) The host status should be set to DOWN
> 4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done
>
> Is the above correct? if so , a RFE on that can be opened
Spot on, that's exactly what I was trying to say! I'd very much like to see an RFE for that.
/K
>
>>
>>
>>> On 01/27/2014 02:50 PM, Andrew Lau wrote:
>>> Hi,
>>>
>>> I think he was asking what if the power management device reported
>>> that the host was powered off. Then VMs should be brought back up as
>>> being off would essentially be the same as running a power cycle/reboot?
>>>
>>> Another example I'm seeing is what happens if the whole host loses
>>> power and it's power management device then becomes unavailable (ie.
>>> not reachable) then you're stuck in the case where it requires manual
>>> intervention.
>>>
>>> I would be interested to potentially see something like a timeout on
>>> those problematic VMs (eg. if nothing was read or write after x amount
>>> of time) then you could consider the host as offline? I guess then
>>> that adds a lot of risk..
>>>
>>>
>>> On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <talayan at redhat.com
>>> <mailto:talayan at redhat.com>> wrote:
>>>
>>> Hi,
>>>
>>> Power management makes use of special *dedicated* hardware in
>>> order to restart hosts independently of host OS. The engine
>>> connects to a power management devices using a *dedicated* network
>>> IP address.
>>> The engine is capable of rebooting hosts that have entered a
>>> non-operational or non-responsive state,
>>> The abilities provided by all power management devices are: check
>>> status, start, stop and recycle (restart)...
>>>
>>> In the case of non-responsive host: all of the VMs that are
>>> currently running on that host can also become non-responsive.
>>> However, the non-responsive host keeps locking the VM hard disk
>>> for all VMs it is running. Attempting to start a VM on a different
>>> host and assign the second host write privileges for the virtual
>>> machine hard disk image can cause data corruption.
>>> Rebooting allows the engine to assume that the lock on a VM hard
>>> disk image has been released.
>>> The engine can know for sure that the problematic host has been
>>> rebooted via the power management device and then it can start a
>>> VM from the problematic host on another host without risking data
>>> corruption.
>>> Important note: A virtual machine that has been marked
>>> highly-available can not be safely started on a different host
>>> without the certainty that doing so will not cause data corruption.
>>>
>>> N-joy,
>>>
>>> --Tareq
>>>
>>>
>>>
>>>
>>> On 01/27/2014 02:05 PM, Dafna Ron wrote:
>>>
>>> I am adding Tareq for the Power Management implementation.
>>>
>>> Dafna
>>>
>>>
>>> On 01/27/2014 11:48 AM, Karli Sjöberg wrote:
>>>
>>> On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:
>>>
>>> Powering off the host will never trigger vm migration.
>>> As far as engine is concerned it just lost connection
>>> to the host, but
>>> has no way of telling if the host is down or if a
>>> router is down.
>>>
>>> Can´t it at least check with power management if the Host
>>> status is down
>>> first?
>>>
>>> I mean, if the network is down there will be no response
>>> from either PM
>>> or Host. But if PM is up and can tell you that the Host is
>>> down, sounds
>>> rather clear cut to me...
>>>
>>> Seems to me the VM's would be restarted sooner if the flow
>>> was altered
>>> to first check with PM if it´s a network or Host issue,
>>> and if Host
>>> issue, immediately restart VM's on another Host, instead
>>> of waiting for
>>> a potentially problematic Host to boot up eventually.
>>>
>>> /K
>>>
>>> since vm's can continue running on the host even if
>>> engine has no access
>>> to it, starting the vm's on the second host can cause
>>> split brain and
>>> data corruption.
>>>
>>> The way that the engine knows what's going on is by
>>> sending heath check
>>> queries to the vdsm.
>>> Power management will try to reboot a host when the
>>> health checks to
>>> vdsm will not be answered.
>>> So... if engine gets no reply and has no way of
>>> rebooting the host, the
>>> host status will be changed to Non-Responsive and the
>>> vm's will be
>>> unknown because engine has no way of knowing what's
>>> happening with the
>>> vm's.
>>> Since reboot of the host will kill the vm's running on
>>> it - this will
>>> never cause any vm migration but... along with the
>>> High-Availability vm
>>> feature, you will be able to have some of the vm's
>>> re-started on the
>>> second host after the host reboot (and that is only if
>>> Power Management
>>> was confirmed as successful).
>>>
>>> VM migration is only triggered when:
>>> 1. Cluster configuration states that the vm should be
>>> migrated in case
>>> of failure
>>> 2. Engine has access to the host - so the failure is
>>> on the storage side
>>> and not the host side.
>>> 3. the vms are not actively writing (although there
>>> might be a new RFE
>>> for it).
>>>
>>> hope this clears things up
>>>
>>> Dafna
>>>
>>>
>>>
>>> On 01/27/2014 10:11 AM, Andrew Lau wrote:
>>>
>>> Hi,
>>>
>>> Have you got power management enabled?
>>>
>>> That's the fencing feature required for the engine
>>> to ensure that the
>>> host is actually offline. It won't resume any
>>> other VMs to prevent
>>> potential VM corruption (eg. VM running on
>>> multiple hosts).
>>>
>>> Andrew.
>>>
>>> On Jan 27, 2014 5:12 PM, "Jaison peter"
>>> <urotrip2 at gmail.com <mailto:urotrip2 at gmail.com>
>>> <mailto:urotrip2 at gmail.com
>>> <mailto:urotrip2 at gmail.com>>> wrote:
>>>
>>> Hi all ,
>>>
>>> I was setting a two node ovirt cluster with
>>> ovirt engine on
>>> seperate node . I completed the configuration
>>> and tested VM live
>>> migrations with out any issues . Then for
>>> checking cluster HA I
>>> powered down one host and expected vms
>>> running on that host to be
>>> migrated to the other one . But nothing
>>> happened , Engine detected
>>> host as un-rechable and marked it as
>>> non-operational and vm ran on
>>> that host went to 'unknown state' . Is that
>>> not possible to setup
>>> a fully HA ovirt cluster with two nodes ? or
>>> else is that my
>>> configuration problem ? please advice .
>>>
>>> Thanks & Regards
>>>
>>> Alex
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org <mailto:Users at ovirt.org>
>>> <mailto:Users at ovirt.org <mailto:Users at ovirt.org>>
>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org <mailto:Users at ovirt.org>
>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>>> --
>>> Dafna Ron
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org <mailto:Users at ovirt.org>
>>> http://lists.ovirt.org/mailman/listinfo/users
>>
>>
More information about the Users
mailing list