
From:suporte@logicworks.pt <suporte@logicworks.pt>=20 Sent: Tuesday 16th April 2013 14:03=20 To: Gianluca Cecchi <gianluca.cecchi@gmail.com>=20 Cc: Ren=C3=A9 Koch <r.koch@ovido.at>; users <Users@ovirt.org>=20 Subject: Re: [Users] High Availability=20 =20 Well, we also disconnected the ilo NIC cable. We did another test, and ju= st disconnected the NIC cables but the ilo NIC cable, and voil=C3=A1 the HA= took about 3 minutes to migrate the VM to the other host. We notice too th= at the manager did a reboot to the failed host. For a more real scenario we= disconnected the power cable from the host and after about 2 or 3 minutes =
=20 Regards=20 Jose=20 =20 ----- Mensagem original -----=20 De: "Gianluca Cecchi" <gianluca.cecchi@gmail.com>=20 Para: suporte@logicworks.pt=20 Cc: "Ren=C3=A9 Koch (ovido)" <r.koch@ovido.at>, "users" <Users@ovirt.org>= =20 Enviadas: Ter=C3=A7a-feira, 16 Abril, 2013 12:12:43=20 Assunto: Re: [Users] High Availability=20 =20 On Tue, Apr 16, 2013 at 12:56 PM, suporte wrote:=20
Hi,=20 =20 We have 2 Fujitsu servers and one iSCSI storage domain. The servers hav= e the power management configured with ilo3.=20 We can live migrate a VM and when rebooting the host of that VM it does=
=20 For testing high availability we disconnected all NIC cables of the VM = host, the VM does not migrate to the other host, we had to manually confirm=
------=_Part_29483_11651952.1366120165705 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Yes, fencing must be working otherwise HA does not work. So in the case of = a power supply failure we have to have a server with a redundant power supp= ly to previse this scenario?=20 ----- Original Message ----- From: "Ren=C3=A9 Koch" <r.koch@ovido.at>=20 To: suporte@logicworks.pt, "Gianluca Cecchi" <gianluca.cecchi@gmail.com>=20 Cc: "users" <Users@ovirt.org>=20 Sent: Ter=C3=A7a-feira, 16 de Abril de 2013 13:31:48=20 Subject: RE: [Users] High Availability=20 -----Original message-----=20 the manager put the host in non-responsive and the VM in unknown state. Is = this the correct behavior?=20 Fencing means that the non-responsive host gets reseted (powered off and on= ).=20 If fencing isn't working (as you disconnected the power cable and so ILO ca= n't send you a success message) the vms want get started on another host.= =20 In your example this seems to be strange, but lets have a look at the follo= wing scenario:=20 - You have 2 datacenters with 1 hypervisor in DC 1 and 1 hypervisor in DC 2= , ovirt-engine is running in DC 1=20 - Connection between dcs is lost=20 - Fencing isn't working=20 - VM is running on host in DC 2=20 - If VM would start on host in DC 1 without successful fencing your vm disk= would be broken (host in DC 2 and DC 1 is writing on the same storage file= )=20 Maybe there are better examples then this one (would be interesting to know= what your storage metro-cluster is doing in this scenario with this split-= brain-situation), but I hope it's clear to you why fencing is working as it= is and what can happen if it would be less restrictive...=20 Regards,=20 Ren=C3=A9=20 the migration to the other host.=20 the host has been rebooted, and than migration happens.=20
=20 Is this the correct behavior? We have to manually confirm that the host= has been rebooted for HA happens?=20 =20 Regards=20 Jose=20 =20 Hello,=20 when you say "we disconnected all NIC cables" you mean "we=20 disconnected all NIC cables but the ones connected to the iLO=20 interface", correct?=20 Because to know that one host has successfully fenced the problematic=20 one, it has to send a get status message and see that it is off or=20 that it has been successfully rebooted.....=20 =20 For esxample in RHCS if you configure iLO as a fencing device it=20 remains indefinitely in state similar to=20 =20 wait for fence to complete=20 =20 if the "fencer" is not able to get an acknowledge about the operation=20 or to reach the other node iLO.=20 Probably you can find something in your logs...=20 =20 Gianluca=20 =20
From: </b>"Ren=C3=A9 Koch" <r.koch@ovido.at><br><b>To: </b>suporte@l= ogicworks.pt, "Gianluca Cecchi" <gianluca.cecchi@gmail.com><br><b>Cc:= </b>"users" <Users@ovirt.org><br><b>Sent: </b>Ter=C3=A7a-feira, 16 d= e Abril de 2013 13:31:48<br><b>Subject: </b>RE: [Users] High Availability<b= r><br><br> <br>-----Original message-----<br>> From:suporte@logicwo= rks.pt <suporte@logicworks.pt><br>> Sent: Tuesday 16th April 2013 = 14:03<br>> To: Gianluca Cecchi <gianluca.cecchi@gmail.com><br>>= Cc: Ren=C3=A9 Koch <r.koch@ovido.at>; users <Users@ovirt.org><= br>> Subject: Re: [Users] High Availability<br>> <br>> Well, we al= so disconnected the ilo NIC cable. We did another test, and just disconnect= ed the NIC cables but the ilo NIC cable, and voil=C3=A1 the HA took about 3= minutes to migrate the VM to the other host. We notice too that the manage= r did a reboot to the failed host. For a more real scenario we disconnected=
- VM is running on host in DC 2<br>- If VM would start on host in DC 1 wit= hout successful fencing your vm disk would be broken (host in DC 2 and DC 1= is writing on the same storage file)<br><br>Maybe there are better example= s then this one (would be interesting to know what your storage metro-clust= er is doing in this scenario with this split-brain-situation), but I hope i= t's clear to you why fencing is working as it is and what can happen if it = would be less restrictive...<br><br><br>Regards,<br>Ren=C3=A9<br><br><br>&g= t; <br>> Regards<br>> Jose<br>> <br>> ----- Mensagem original -= ----<br>> De: "Gianluca Cecchi" <gianluca.cecchi@gmail.com><br>>= ; Para: suporte@logicworks.pt<br>> Cc: "Ren=C3=A9 Koch (ovido)" <r.ko= ch@ovido.at>, "users" <Users@ovirt.org><br>> Enviadas: Ter=C3= =A7a-feira, 16 Abril, 2013 12:12:43<br>> Assunto: Re: [Users] High Avail= ability<br>> <br>> On Tue, Apr 16, 2013 at 12:56 PM, suporte wr= ote:<br>> > Hi,<br>> ><br>> > We have 2 Fujitsu servers a= nd one iSCSI storage domain. The servers have the power management configur= ed with ilo3.<br>> > We can live migrate a VM and when rebooting the = host of that VM it does the migration to the other host.<br>> ><br>&g= t; > For testing high availability we disconnected all NIC cables of the= VM host, the VM does not migrate to the other host, we had to manually con= firm the host has been rebooted, and than migration happens.<br>> ><b= r>> > Is this the correct behavior? We have to manually confirm that =
------=_Part_29483_11651952.1366120165705 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable <html><head><style type=3D'text/css'>p { margin: 0; }</style></head><body><= div style=3D'font-family: arial,helvetica,sans-serif; font-size: 10pt; colo= r: #000000'>Yes, fencing must be working otherwise HA does not work. So in = the case of a power supply failure we have to have a server with a redundan= t power supply to previse this scenario?<br><br><hr id=3D"zwchr"><div style= =3D"color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-deco= ration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><b= the power cable from the host and after about 2 or 3 minutes the manager p= ut the host in non-responsive and the VM in unknown state. Is this the corr= ect behavior?<br><br><br>Fencing means that the non-responsive host gets re= seted (powered off and on).<br>If fencing isn't working (as you disconnecte= d the power cable and so ILO can't send you a success message) the vms want= get started on another host.<br>In your example this seems to be strange, = but lets have a look at the following scenario:<br>- You have 2 datacenters= with 1 hypervisor in DC 1 and 1 hypervisor in DC 2, ovirt-engine is runnin= g in DC 1<br>- Connection between dcs is lost<br>- Fencing isn't working<br= the host has been rebooted for HA happens?<br>> ><br>> > Regard= s<br>> > Jose<br>> <br>> Hello,<br>> when you say "we discon= nected all NIC cables" you mean "we<br>> disconnected all NIC cables but= the ones connected to the iLO<br>> interface", correct?<br>> Because= to know that one host has successfully fenced the problematic<br>> one,= it has to send a get status message and see that it is off or<br>> that= it has been successfully rebooted.....<br>> <br>> For esxample in RH= CS if you configure iLO as a fencing device it<br>> remains indefinitely= in state similar to<br>> <br>> wait for fence to complete<br>> <b= r>> if the "fencer" is not able to get an acknowledge about the operatio= n<br>> or to reach the other node iLO.<br>> Probably you can find som= ething in your logs...<br>> <br>> Gianluca<br>> <br></div><br></di= v></body></html> ------=_Part_29483_11651952.1366120165705--