[Users] two node ovirt cluster with HA

Karli Sjöberg Karli.Sjoberg at slu.se
Tue Jan 28 08:30:30 UTC 2014



Skickat från min iPhone

> 27 jan 2014 kl. 16:40 skrev "Eli Mesika" <emesika at redhat.com>:
> 
> 
> 
> ----- Original Message -----
>> From: "Tareq Alayan" <talayan at redhat.com>
>> To: "Andrew Lau" <andrew at andrewklau.com>, "Eli Mesika" <emesika at redhat.com>
>> Cc: dron at redhat.com, "Karli Sjöberg" <Karli.Sjoberg at slu.se>, users at ovirt.org
>> Sent: Monday, January 27, 2014 2:59:02 PM
>> Subject: Re: [Users] two node ovirt cluster with HA
>> 
>> Adding Eli.
> 
> I just want to summarize the requirement as I understand it:
> 
> In the case that a Host that is running HA VMs and have PM configured is turned off manually :
> 
> 1) The non-responsive treatment should be modified to check Host status via PM agent 
> 2) If Host is off , HA VMs will attempt to run on another host ASAP
> 3) The host status should be set to DOWN
> 4) No attempt to restart vdsm (soft fencing) or restart the host (hard fencing) will be done 
> 
> Is the above correct? if so , a RFE on that can be opened 

Spot on, that's exactly what I was trying to say! I'd very much like to see an RFE for that.

/K

> 
>> 
>> 
>>> On 01/27/2014 02:50 PM, Andrew Lau wrote:
>>> Hi,
>>> 
>>> I think he was asking what if the power management device reported
>>> that the host was powered off. Then VMs should be brought back up as
>>> being off would essentially be the same as running a power cycle/reboot?
>>> 
>>> Another example I'm seeing is what happens if the whole host loses
>>> power and it's power management device then becomes unavailable (ie.
>>> not reachable) then you're stuck in the case where it requires manual
>>> intervention.
>>> 
>>> I would be interested to potentially see something like a timeout on
>>> those problematic VMs (eg. if nothing was read or write after x amount
>>> of time) then you could consider the host as offline? I guess then
>>> that adds a lot of risk..
>>> 
>>> 
>>> On Mon, Jan 27, 2014 at 11:43 PM, Tareq Alayan <talayan at redhat.com
>>> <mailto:talayan at redhat.com>> wrote:
>>> 
>>>    Hi,
>>> 
>>>    Power management makes use of special *dedicated* hardware in
>>>    order to restart hosts independently of host OS. The engine
>>>    connects to a power management devices using a *dedicated* network
>>>    IP address.
>>>    The engine is capable of rebooting hosts that have entered a
>>>    non-operational or non-responsive state,
>>>    The abilities provided by all power management devices are: check
>>>    status, start, stop and recycle (restart)...
>>> 
>>>    In the case of non-responsive host: all of the VMs that are
>>>    currently running on that host can also become non-responsive.
>>>    However, the non-responsive host keeps locking the VM hard disk
>>>    for all VMs it is running. Attempting to start a VM on a different
>>>    host and assign the second host write privileges for the virtual
>>>    machine hard disk image can cause data corruption.
>>>    Rebooting allows the engine to assume that the lock on a VM hard
>>>    disk image has been released.
>>>    The engine can know for sure that the problematic host has been
>>>    rebooted via the power management device and then it can start a
>>>    VM from the problematic host on another host without risking data
>>>    corruption.
>>>    Important note: A virtual machine that has been marked
>>>    highly-available can not be safely started on a different host
>>>    without the certainty that doing so will not cause data corruption.
>>> 
>>>    N-joy,
>>> 
>>>    --Tareq
>>> 
>>> 
>>> 
>>> 
>>>    On 01/27/2014 02:05 PM, Dafna Ron wrote:
>>> 
>>>        I am adding Tareq for the Power Management implementation.
>>> 
>>>        Dafna
>>> 
>>> 
>>>        On 01/27/2014 11:48 AM, Karli Sjöberg wrote:
>>> 
>>>            On Mon, 2014-01-27 at 11:11 +0000, Dafna Ron wrote:
>>> 
>>>                Powering off the host will never trigger vm migration.
>>>                As far as engine is concerned it just lost connection
>>>                to the host, but
>>>                has no way of telling if the host is down or if a
>>>                router is down.
>>> 
>>>            Can´t it at least check with power management if the Host
>>>            status is down
>>>            first?
>>> 
>>>            I mean, if the network is down there will be no response
>>>            from either PM
>>>            or Host. But if PM is up and can tell you that the Host is
>>>            down, sounds
>>>            rather clear cut to me...
>>> 
>>>            Seems to me the VM's would be restarted sooner if the flow
>>>            was altered
>>>            to first check with PM if it´s a network or Host issue,
>>>            and if Host
>>>            issue, immediately restart VM's on another Host, instead
>>>            of waiting for
>>>            a potentially problematic Host to boot up eventually.
>>> 
>>>            /K
>>> 
>>>                since vm's can continue running on the host even if
>>>                engine has no access
>>>                to it, starting the vm's on the second host can cause
>>>                split brain and
>>>                data corruption.
>>> 
>>>                The way that the engine knows what's going on is by
>>>                sending heath check
>>>                queries to the vdsm.
>>>                Power management will try to reboot a host when the
>>>                health checks to
>>>                vdsm will not be answered.
>>>                So... if engine gets no reply and has no way of
>>>                rebooting the host, the
>>>                host status will be changed to Non-Responsive and the
>>>                vm's will be
>>>                unknown because engine has no way of knowing what's
>>>                happening with the
>>>                vm's.
>>>                Since reboot of the host will kill the vm's running on
>>>                it - this will
>>>                never cause any vm migration but... along with the
>>>                High-Availability vm
>>>                feature, you will be able to have some of the vm's
>>>                re-started on the
>>>                second host after the host reboot (and that is only if
>>>                Power Management
>>>                was confirmed as successful).
>>> 
>>>                VM migration is only triggered when:
>>>                1. Cluster configuration states that the vm should be
>>>                migrated in case
>>>                of failure
>>>                2. Engine has access to the host - so the failure is
>>>                on the storage side
>>>                and not the host side.
>>>                3. the vms are not actively writing (although there
>>>                might be a new RFE
>>>                for it).
>>> 
>>>                hope this clears things up
>>> 
>>>                Dafna
>>> 
>>> 
>>> 
>>>                On 01/27/2014 10:11 AM, Andrew Lau wrote:
>>> 
>>>                    Hi,
>>> 
>>>                    Have you got power management enabled?
>>> 
>>>                    That's the fencing feature required for the engine
>>>                    to ensure that the
>>>                    host is actually offline. It won't resume any
>>>                    other VMs to prevent
>>>                    potential VM corruption (eg. VM running on
>>>                    multiple hosts).
>>> 
>>>                    Andrew.
>>> 
>>>                    On Jan 27, 2014 5:12 PM, "Jaison peter"
>>>                    <urotrip2 at gmail.com <mailto:urotrip2 at gmail.com>
>>>                    <mailto:urotrip2 at gmail.com
>>>                    <mailto:urotrip2 at gmail.com>>> wrote:
>>> 
>>>                         Hi all ,
>>> 
>>>                         I was setting a two node ovirt cluster with
>>>                    ovirt engine on
>>>                         seperate node . I completed the configuration
>>>                    and tested VM  live
>>>                         migrations with out any issues . Then for
>>>                    checking cluster HA I
>>>                         powered down one host and expected vms
>>>                    running on that host to be
>>>                         migrated to the other one . But nothing
>>>                    happened , Engine detected
>>>                         host as un-rechable and marked it as
>>>                    non-operational and vm ran on
>>>                         that host went to 'unknown state' . Is that
>>>                    not possible to setup
>>>                         a fully HA ovirt cluster with two nodes ? or
>>>                    else is that my
>>>                         configuration problem ? please advice .
>>> 
>>>                         Thanks & Regards
>>> 
>>>                         Alex
>>> 
>>>                         _______________________________________________
>>>                         Users mailing list
>>>                    Users at ovirt.org <mailto:Users at ovirt.org>
>>>                    <mailto:Users at ovirt.org <mailto:Users at ovirt.org>>
>>>                    http://lists.ovirt.org/mailman/listinfo/users
>>> 
>>> 
>>> 
>>>                    _______________________________________________
>>>                    Users mailing list
>>>                    Users at ovirt.org <mailto:Users at ovirt.org>
>>>                    http://lists.ovirt.org/mailman/listinfo/users
>>> 
>>> 
>>>                --
>>>                Dafna Ron
>>>                _______________________________________________
>>>                Users mailing list
>>>                Users at ovirt.org <mailto:Users at ovirt.org>
>>>                http://lists.ovirt.org/mailman/listinfo/users
>> 
>> 



More information about the Users mailing list