Re: [Users] High Availability

16 Apr 2013

      ...
From:suporte@logicworks.pt <suporte@logicworks.pt>=20
Sent: Tuesday 16th April 2013 14:03=20
To: Gianluca Cecchi <gianluca.cecchi@gmail.com>=20
Cc: Ren=C3=A9 Koch <r.koch@ovido.at>; users <Users@ovirt.org>=20
Subject: Re: [Users] High Availability=20
=20
Well, we also disconnected the ilo NIC cable. We did another test, and ju=
st disconnected the NIC cables but the ilo NIC cable, and voil=C3=A1 the HA=
 took about 3 minutes to migrate the VM to the other host. We notice too th=
at the manager did a reboot to the failed host. For a more real scenario we=
 disconnected the power cable from the host and after about 2 or 3 minutes =
...
=20
Regards=20
Jose=20
=20
----- Mensagem original -----=20
De: "Gianluca Cecchi" <gianluca.cecchi@gmail.com>=20
Para: suporte@logicworks.pt=20
Cc: "Ren=C3=A9 Koch (ovido)" <r.koch@ovido.at>, "users" <Users@ovirt.org>=
=20
Enviadas: Ter=C3=A7a-feira, 16 Abril, 2013 12:12:43=20
Assunto: Re: [Users] High Availability=20
=20
On Tue, Apr 16, 2013 at 12:56 PM, suporte wrote:=20
...
Hi,=20
=20
We have 2 Fujitsu servers and one iSCSI storage domain. The servers hav=
e the power management configured with ilo3.=20
We can live migrate a VM and when rebooting the host of that VM it does=
...
...
=20
For testing high availability we disconnected all NIC cables of the VM =
host, the VM does not migrate to the other host, we had to manually confirm=
------=_Part_29483_11651952.1366120165705
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Yes, fencing must be working otherwise HA does not work. So in the case of =
a power supply failure we have to have a server with a redundant power supp=
ly to previse this scenario?=20

----- Original Message -----

From: "Ren=C3=A9 Koch" <r.koch@ovido.at>=20
To: suporte@logicworks.pt, "Gianluca Cecchi" <gianluca.cecchi@gmail.com>=20
Cc: "users" <Users@ovirt.org>=20
Sent: Ter=C3=A7a-feira, 16 de Abril de 2013 13:31:48=20
Subject: RE: [Users] High Availability=20

-----Original message-----=20
the manager put the host in non-responsive and the VM in unknown state. Is =
this the correct behavior?=20

Fencing means that the non-responsive host gets reseted (powered off and on=
).=20
If fencing isn't working (as you disconnected the power cable and so ILO ca=
n't send you a success message) the vms want get started on another host.=
=20
In your example this seems to be strange, but lets have a look at the follo=
wing scenario:=20
- You have 2 datacenters with 1 hypervisor in DC 1 and 1 hypervisor in DC 2=
, ovirt-engine is running in DC 1=20
- Connection between dcs is lost=20
- Fencing isn't working=20
- VM is running on host in DC 2=20
- If VM would start on host in DC 1 without successful fencing your vm disk=
 would be broken (host in DC 2 and DC 1 is writing on the same storage file=
)=20

Maybe there are better examples then this one (would be interesting to know=
 what your storage metro-cluster is doing in this scenario with this split-=
brain-situation), but I hope it's clear to you why fencing is working as it=
 is and what can happen if it would be less restrictive...=20

Regards,=20
Ren=C3=A9=20

 the migration to the other host.=20
 the host has been rebooted, and than migration happens.=20
...
...
=20
Is this the correct behavior? We have to manually confirm that the host=
 has been rebooted for HA happens?=20
=20
Regards=20
Jose=20
=20
Hello,=20
when you say "we disconnected all NIC cables" you mean "we=20
disconnected all NIC cables but the ones connected to the iLO=20
interface", correct?=20
Because to know that one host has successfully fenced the problematic=20
one, it has to send a get status message and see that it is off or=20
that it has been successfully rebooted.....=20
=20
For esxample in RHCS if you configure iLO as a fencing device it=20
remains indefinitely in state similar to=20
=20
wait for fence to complete=20
=20
if the "fencer" is not able to get an acknowledge about the operation=20
or to reach the other node iLO.=20
Probably you can find something in your logs...=20
=20
Gianluca=20
=20
...
From: </b>"Ren=C3=A9 Koch" <r.koch@ovido.at><br><b>To: </b>suporte@l=
ogicworks.pt, "Gianluca Cecchi" <gianluca.cecchi@gmail.com><br><b>Cc:=
 </b>"users" <Users@ovirt.org><br><b>Sent: </b>Ter=C3=A7a-feira, 16 d=
e Abril de 2013 13:31:48<br><b>Subject: </b>RE: [Users] High Availability<b=
r><br><br> <br>-----Original message-----<br>> From:suporte@logicwo=
rks.pt <suporte@logicworks.pt><br>> Sent: Tuesday 16th April 2013 =
14:03<br>> To: Gianluca Cecchi <gianluca.cecchi@gmail.com><br>>=
 Cc: Ren=C3=A9 Koch <r.koch@ovido.at>; users <Users@ovirt.org><=
br>> Subject: Re: [Users] High Availability<br>> <br>> Well, we al=
so disconnected the ilo NIC cable. We did another test, and just disconnect=
ed the NIC cables but the ilo NIC cable, and voil=C3=A1 the HA took about 3=
 minutes to migrate the VM to the other host. We notice too that the manage=
r did a reboot to the failed host. For a more real scenario we disconnected=
...
- VM is running on host in DC 2<br>- If VM would start on host in DC 1 wit=
hout successful fencing your vm disk would be broken (host in DC 2 and DC 1=
 is writing on the same storage file)<br><br>Maybe there are better example=
s then this one (would be interesting to know what your storage metro-clust=
er is doing in this scenario with this split-brain-situation), but I hope i=
t's clear to you why fencing is working as it is and what can happen if it =
would be less restrictive...<br><br><br>Regards,<br>Ren=C3=A9<br><br><br>&g=
t; <br>> Regards<br>> Jose<br>> <br>> ----- Mensagem original -=
----<br>> De: "Gianluca Cecchi" <gianluca.cecchi@gmail.com><br>>=
; Para: suporte@logicworks.pt<br>> Cc: "Ren=C3=A9 Koch (ovido)" <r.ko=
ch@ovido.at>, "users" <Users@ovirt.org><br>> Enviadas: Ter=C3=
=A7a-feira, 16 Abril, 2013 12:12:43<br>> Assunto: Re: [Users] High Avail=
ability<br>> <br>> On Tue, Apr 16, 2013 at 12:56 PM,  suporte wr=
ote:<br>> > Hi,<br>> ><br>> > We have 2 Fujitsu servers a=
nd one iSCSI storage domain. The servers have the power management configur=
ed with ilo3.<br>> > We can live migrate a VM and when rebooting the =
host of that VM it does the migration to the other host.<br>> ><br>&g=
t; > For testing high availability we disconnected all NIC cables of the=
 VM host, the VM does not migrate to the other host, we had to manually con=
firm the host has been rebooted, and than migration happens.<br>> ><b=
r>> > Is this the correct behavior? We have to manually confirm that =
------=_Part_29483_11651952.1366120165705
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><head><style type=3D'text/css'>p { margin: 0; }</style></head><body><=
div style=3D'font-family: arial,helvetica,sans-serif; font-size: 10pt; colo=
r: #000000'>Yes, fencing must be working otherwise HA does not work. So in =
the case of a power supply failure we have to have a server with a redundan=
t power supply to previse this scenario?<br><br><hr id=3D"zwchr"><div style=
=3D"color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-deco=
ration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><b=
 the power cable from the host and after about 2 or 3 minutes the manager p=
ut the host in non-responsive and the VM in unknown state. Is this the corr=
ect behavior?<br><br><br>Fencing means that the non-responsive host gets re=
seted (powered off and on).<br>If fencing isn't working (as you disconnecte=
d the power cable and so ILO can't send you a success message) the vms want=
 get started on another host.<br>In your example this seems to be strange, =
but lets have a look at the following scenario:<br>- You have 2 datacenters=
 with 1 hypervisor in DC 1 and 1 hypervisor in DC 2, ovirt-engine is runnin=
g in DC 1<br>- Connection between dcs is lost<br>- Fencing isn't working<br=
the host has been rebooted for HA happens?<br>> ><br>> > Regard=
s<br>> > Jose<br>> <br>> Hello,<br>> when you say "we discon=
nected all NIC cables" you mean "we<br>> disconnected all NIC cables but=
 the ones connected to the iLO<br>> interface", correct?<br>> Because=
 to know that one host has successfully fenced the problematic<br>> one,=
 it has to send a get status message and see that it is off or<br>> that=
 it has been successfully rebooted.....<br>> <br>> For esxample in RH=
CS if you configure iLO as a fencing device it<br>> remains indefinitely=
 in state similar to<br>> <br>> wait for fence to complete<br>> <b=
r>> if the "fencer" is not able to get an acknowledge about the operatio=
n<br>> or to reach the other node iLO.<br>> Probably you can find som=
ething in your logs...<br>> <br>> Gianluca<br>> <br></div><br></di=
v></body></html>
------=_Part_29483_11651952.1366120165705--

Re: [Users] High Availability

suporte＠logicworks.pt