Re: [Users] High Availability

16 Apr 2013

      -----Original message-----
...
From:suporte@logicworks.pt <suporte@logicworks.pt>
Sent: Tuesday 16th April 2013 14:03
To: Gianluca Cecchi <gianluca.cecchi@gmail.com>
Cc: René Koch <r.koch@ovido.at>; users <Users@ovirt.org>
Subject: Re: [Users] High Availability
Well, we also disconnected the ilo NIC cable. We did another test, and just disconnected the NIC cables but the ilo NIC cable, and voilá the HA took about 3 minutes to migrate the VM to the other host. We notice too that the manager did a reboot to the failed host. For a more real scenario we disconnected the power cable from the host and after about 2 or 3 minutes the manager put the host in non-responsive and the VM in unknown state. Is this the correct behavior?
Fencing means that the non-responsive host gets reseted (powered off and on).
If fencing isn't working (as you disconnected the power cable and so ILO can't send you a success message) the vms want get started on another host.
In your example this seems to be strange, but lets have a look at the following scenario:
- You have 2 datacenters with 1 hypervisor in DC 1 and 1 hypervisor in DC 2, ovirt-engine is running in DC 1
- Connection between dcs is lost
- Fencing isn't working
- VM is running on host in DC 2
- If VM would start on host in DC 1 without successful fencing your vm disk would be broken (host in DC 2 and DC 1 is writing on the same storage file)

Maybe there are better examples then this one (would be interesting to know what your storage metro-cluster is doing in this scenario with this split-brain-situation), but I hope it's clear to you why fencing is working as it is and what can happen if it would be less restrictive...

Regards,
René
...
Regards
Jose
----- Mensagem original -----
De: "Gianluca Cecchi" <gianluca.cecchi@gmail.com>
Para: suporte@logicworks.pt
Cc: "René Koch (ovido)" <r.koch@ovido.at>, "users" <Users@ovirt.org>
Enviadas: Terça-feira, 16 Abril, 2013 12:12:43
Assunto: Re: [Users] High Availability
On Tue, Apr 16, 2013 at 12:56 PM,  suporte wrote:
...
Hi,
We have 2 Fujitsu servers and one iSCSI storage domain. The servers have the power management configured with ilo3.
We can live migrate a VM and when rebooting the host of that VM it does the migration to the other host.
For testing high availability we disconnected all NIC cables of the VM host, the VM does not migrate to the other host, we had to manually confirm the host has been rebooted, and than migration happens.
Is this the correct behavior? We have to manually confirm that the host has been rebooted for HA happens?
Regards
Jose
Hello,
when you say "we disconnected all NIC cables" you mean "we
disconnected all NIC cables but the ones connected to the iLO
interface", correct?
Because to know that one host has successfully fenced the problematic
one, it has to send a get status message and see that it is off or
that it has been successfully rebooted.....
For esxample in RHCS if you configure iLO as a fencing device it
remains indefinitely in state similar to
wait for fence to complete
if the "fencer" is not able to get an acknowledge about the operation
or to reach the other node iLO.
Probably you can find something in your logs...
Gianluca