Re: [Users] Testing High Availability and Power outages

13 Jan 2013

      ------=_Part_3933742_65602238.1358067259763
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

----- Original Message -----
...
From: "Alexandru Vladulescu" <avladulescu@bfproject.ro>
To: "Doron Fediuck" <dfediuck@redhat.com>
Cc: "users" <users@ovirt.org>
Sent: Sunday, January 13, 2013 10:46:41 AM
Subject: Re: [Users] Testing High Availability and Power outages
...
Dear Doron,
...
I haven't collected the logs from the tests, but I would gladly re-do
the case and get back to you asap.
...
This feature is the main reason of which I have chosen to go with
Ovirt in the first place, besides other virt environments.
...
Could you please inform me what logs should I be focusing on, besides
the engine log; vdsm maybe or other relevant logs?
...
Regards,
Alex
...
--
Sent from phone.
...
On 13.01.2013, at 09:56, Doron Fediuck < dfediuck@redhat.com > wrote:
...
...
----- Original Message -----

...
...
...
From: "Alexandru Vladulescu" < avladulescu@bfproject.ro >
...
...
To: "users" < users@ovirt.org >
...
...
Sent: Friday, January 11, 2013 2:47:38 PM
...
...
Subject: [Users] Testing High Availability and Power outages

...
...
...
Hi,

...
...
...
Today, I started testing on my Ovirt 3.1 installation (from
dreyou
repos) running on 3 x Centos 6.3 hypervisors the High
Availability
features and the fence mechanism.

...
...
...
As yesterday, I have reported in a previous email thread, that
the
migration priority queue cannot be increased (bug) in this
current
version, I decided to test what the official documentation says
about the High Availability cases.

...
...
...
This will be a disaster case scenarios to suffer from if one
hypervisor has a power outage/hardware problem and the VMs
running
on it are not migrating on other spare resources.

...
...
...
In the official documenation from ovirt.org it is quoted the
following:
...
...
High availability

...
...
...
Allows critical VMs to be restarted on another host in the event
of
hardware failure with three levels of priority, taking into
account
resiliency policy.

...
...
...
* Resiliency policy to control high availability VMs at the
cluster
level.
...
...
* Supports application-level high availability with supported
fencing
agents.

...
...
...
As well as in the Architecture description:

...
...
...
High Availability - restart guest VMs from failed hosts
automatically
on other hosts

...
...
...
So the testing went like this -- One VM running a linux box,
having
the check box "High Available" and "Priority for Run/Migration
queue:" set to Low. On Host we have the check box to "Any Host in
Cluster", without "Allow VM migration only upon Admin specific
request" checked.

...
...
...
My environment:

...
...
...
Configuration : 2 x Hypervisors (same cluster/hardware
configuration)
; 1 x Hypervisor + acting as a NAS (NFS) server (different
cluster/hardware configuration)

...
...
...
Actions: Went and cut-off the power from one of the hypervisors
from
the 2 node clusters, while the VM was running on. This would
translate to a power outage.

...
...
...
Results: The hypervisor node that suffered from the outage is
showing
in Hosts tab as Non Responsive on Status, and the VM has a
question
mark and cannot be powered off or nothing (therefore it's stuck).

...
...
...
In the Log console in GUI, I get:

...
...
...
Host Hyper01 is non-responsive.
...
...
VM Web-Frontend01 was set to the Unknown status.

...
...
...
There is nothing I could I could do besides clicking on the
Hyper01
"Confirm Host as been rebooted", afterwards the VM starts on the
Hyper02 with a cold reboot of the VM.

...
...
...
The Log console changes to:

...
...
...
Vm Web-Frontend01 was shut down due to Hyper01 host reboot or
manual
fence
...
...
All VMs' status on Non-Responsive Host Hyper01 were changed to
'Down'
by admin@internal
...
...
Manual fencing for host Hyper01 was started.
...
...
VM Web-Frontend01 was restarted on Host Hyper02

...
...
...
I would like you approach on this problem, reading the
documentation
& features pages on the official website, I suppose that this
would
have been an automatically mechanism working on some sort of a
vdsm
& engine fencing action. Am I missing something regarding it ?

...
...
...
Thank you for your patience reading this.

...
...
...
Regards,
...
...
Alex.

...
...
...
_______________________________________________
...
...
Users mailing list
...
...
Users@ovirt.org
...
...
http://lists.ovirt.org/mailman/listinfo/users

...
...
Hi Alex,
...
Can you share with us the engine's log from the relevant time
period?

...
...
Doron
...
This feature is the main reason of which I have chosen to go with Ovirt in=
Hi Alex, 
engine log is the important one, as it will indicate on the decision making process. 
VDSM logs should be kept in case something is unclear, but I suggest we begin with 
engine.log. 

------=_Part_3933742_65602238.1358067259763
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><head><style type=3D'text/css'>p { margin: 0; }</style></head><body><=
div style=3D'font-family: times new roman,new york,times,serif; font-size: =
12pt; color: #000000'><br><br><hr id=3D"zwchr"><blockquote style=3D"border-=
left:2px solid rgb(16, 16, 255);margin-left:5px;padding-left:5px;color:#000=
;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helv=
etica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Alexandru Vladulescu"=
 <avladulescu@bfproject.ro><br><b>To: </b>"Doron Fediuck" <dfediuc=
k@redhat.com><br><b>Cc: </b>"users" <users@ovirt.org><br><b>Sent: =
</b>Sunday, January 13, 2013 10:46:41 AM<br><b>Subject: </b>Re: [Users] Tes=
ting High Availability and Power outages<br><br><div>Dear Doron,</div><div>=
<br></div><div>I haven't collected the logs from the tests, but I would gla=
dly re-do the case and get back to you asap. </div><div><br></div><div=
 the first place, besides other virt environments.</div><div><br></div><div=
...
Could you please inform me what logs should I be focusing on, <span c=
lass=3D"Apple-style-span" style=3D"-webkit-tap-highlight-color: rgba(26, 26=
, 26, 0.296875); -webkit-composition-fill-color: rgba(175, 192, 227, 0.2304=
69); -webkit-composition-frame-color: rgba(77, 128, 180, 0.230469); ">besid=
es the engine log; vdsm maybe or other relevant logs?</span></div><div><br>=
<div><div>Regards,</div><div>Alex</div></div><div><br></div><div><br></div>=
<div><span class=3D"Apple-style-span" style=3D"-webkit-tap-highlight-color:=
 rgba(26, 26, 26, 0.292969); -webkit-composition-fill-color: rgba(175, 192,=
 227, 0.230469); -webkit-composition-frame-color: rgba(77, 128, 180, 0.2304=
69);">--</span></div><div><span class=3D"Apple-style-span" style=3D"-webkit=
-tap-highlight-color: rgba(26, 26, 26, 0.296875); -webkit-composition-fill-=
color: rgba(175, 192, 227, 0.230469); -webkit-composition-frame-color: rgba=
(77, 128, 180, 0.230469); ">Sent from phone.&lt;/span>&lt;/div>&lt;/div>&lt;div>&lt;br>On =
13.01.2013, at 09:56, Doron Fediuck &lt;&lt;a href=3D"mailto:dfediuck@redhat.c=
om" target=3D"_blank">dfediuck@redhat.com&lt;/a>&gt; wrote:&lt;br>&lt;br>&lt;/div>&lt;div>=
&lt;/div>&lt;blockquote>&lt;div>&lt;div style=3D"font-family: times new roman,new york,=
times,serif; font-size: 12pt; color: #000000">&lt;br>&lt;br>&lt;hr id=3D"zwchr">&lt;blo=
ckquote style=3D"border-left:2px solid rgb(16, 16, 255);margin-left:5px;pad=
ding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decorati=
on:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;">&lt;b>From: &lt;/=
b>"Alexandru Vladulescu" &lt;&lt;a href=3D"mailto:avladulescu@bfproject.ro" ta=
rget=3D"_blank">avladulescu@bfproject.ro&lt;/a>&gt;&lt;br>&lt;b>To: &lt;/b>"users" &lt;=
&lt;a href=3D"mailto:users@ovirt.org" target=3D"_blank">users@ovirt.org&lt;/a>&gt=
;&lt;br>&lt;b>Sent: &lt;/b>Friday, January 11, 2013 2:47:38 PM&lt;br>&lt;b>Subject: &lt;/b>[U=
sers] Testing High Availability and Power outages&lt;br>&lt;br>
 =20
=20
 =20
 =20
    <br>
    Hi,<br>
    <br>
    <br>
    Today, I started testing on my Ovirt 3.1 installation (from dreyou
    repos) running on 3 x Centos 6.3 hypervisors the High Availability
    features and the fence mechanism.<br>
    <br>
    As yesterday, I have reported in a previous email thread, that the
    migration priority queue cannot be increased (bug) in this current
    version, I decided to test what the official documentation says
    about the High Availability cases. <br>
    <br>
    This will be a disaster case scenarios to suffer from if one
    hypervisor has a power outage/hardware problem and the VMs running
    on it are not migrating on other spare resources.<br>
    <br>
    <br>
    In the official documenation from <a href=3D"http://ovirt.org" target=
=3D"_blank">ovirt.org</a> it is quoted the
    following:<br>
    <h3> <span class=3D"mw-headline" id=3D"High_availability"> <font color=
=3D"#333399"><i><small>High availability </small></i></font></span></h3>
    <font color=3D"#333399"><i><small>
        </small></i></font>
    <p><font color=3D"#333399"><i><small>Allows critical VMs to be
            restarted on another host in the event of hardware failure
            with three levels of priority, taking into account
            resiliency policy.
          </small></i></font></p>
    <font color=3D"#333399"><i><small>
        </small></i></font>
    <ul>
      <li><font color=3D"#333399"><i><small> Resiliency policy to control
              high availability VMs at the cluster level.
            </small></i></font></li>
      <li><font color=3D"#333399"><i><small> Supports application-level
              high availability with supported fencing agents.
            </small></i></font></li>
    </ul>
    <br>
    As well as in the Architecture description:<br>
    <font color=3D"#333399"><br>
      <small><i>High Availability - restart guest VMs from failed hosts
          automatically on other hosts</i></small></font><br>
    <br>
    <br>
    <br>
    So the testing went like this -- One VM running a linux box, having
    the check box "High Available" and "Priority for Run/Migration
    queue:" set to Low. On Host we have the check box to "Any Host in
    Cluster", without "Allow VM migration only upon Admin specific
    request" checked.<br>
    <br>
    <br>
    <br>
    My environment:<br>
    <br>
    <br>
    Configuration :  2 x Hypervisors (same cluster/hardware
    configuration) ; 1 x Hypervisor + acting as a NAS (NFS) server
    (different cluster/hardware configuration)<br>
    <br>
    Actions: Went and cut-off the power from one of the hypervisors from
    the 2 node clusters, while the VM was running on. This would
    translate to a power outage.<br>
    <br>
    Results: The hypervisor node that suffered from the outage is
    showing in Hosts tab as Non Responsive on Status, and the VM has a
    question mark and cannot be powered off or nothing (therefore it's
    stuck).<br>
    <br>
    In the Log console in GUI, I get: <br>
    <br>
   =20
   =20
    <span style=3D"color: rgb(255, 255, 255); font-family: 'Arial Unicode
      MS', Arial, sans-serif; font-size: small; font-style: normal;
      font-variant: normal; font-weight: normal; letter-spacing: normal;
      line-height: 26px; orphans: 2; text-align: start; text-indent:
      0px; text-transform: none; white-space: nowrap; widows: 2;
      word-spacing: 0px; -webkit-text-size-adjust: auto;
      -webkit-text-stroke-width: 0px; background-color: rgb(102, 102,
      102); display: inline !important; float: none; ">Host Hyper01 is
      non-responsive.</span><br>
   =20
    <span style=3D"color: rgb(255, 255, 255); font-family: 'Arial Unicode
      MS', Arial, sans-serif; font-size: small; font-style: normal;
      font-variant: normal; font-weight: normal; letter-spacing: normal;
      line-height: 26px; orphans: 2; text-align: start; text-indent:
      0px; text-transform: none; white-space: nowrap; widows: 2;
      word-spacing: 0px; -webkit-text-size-adjust: auto;
      -webkit-text-stroke-width: 0px; background-color: rgb(102, 102,
      102); display: inline !important; float: none; ">VM Web-Frontend01
      was set to the Unknown status.</span><br>
   =20
    <br>
    There is nothing I could I could do besides clicking on the Hyper01
    "Confirm Host as been rebooted", afterwards the VM starts on the
    Hyper02 with a cold reboot of the VM.<br>
    <br>
    The Log console changes to:<br>
    <br>
   =20
    <span style=3D"color: rgb(255, 255, 255); font-family: 'Arial Unicode
      MS', Arial, sans-serif; font-size: small; font-style: normal;
      font-variant: normal; font-weight: normal; letter-spacing: normal;
      line-height: 26px; orphans: 2; text-align: start; text-indent:
      0px; text-transform: none; white-space: nowrap; widows: 2;
      word-spacing: 0px; -webkit-text-size-adjust: auto;
      -webkit-text-stroke-width: 0px; background-color: rgb(102, 102,
      102); display: inline !important; float: none; ">Vm Web-Frontend01
      was shut down due to Hyper01 host reboot or manual fence</span><br>
   =20
    <span style=3D"color: rgb(255, 255, 255); font-family: 'Arial Unicode
      MS', Arial, sans-serif; font-size: small; font-style: normal;
      font-variant: normal; font-weight: normal; letter-spacing: normal;
      line-height: 26px; orphans: 2; text-align: start; text-indent:
      0px; text-transform: none; white-space: nowrap; widows: 2;
      word-spacing: 0px; -webkit-text-size-adjust: auto;
      -webkit-text-stroke-width: 0px; background-color: rgb(102, 102,
      102); display: inline !important; float: none; ">All VMs' status
      on Non-Responsive Host Hyper01 were changed to 'Down' by
      admin@internal</span><br>
   =20
    <span style=3D"color: rgb(255, 255, 255); font-family: 'Arial Unicode
      MS', Arial, sans-serif; font-size: small; font-style: normal;
      font-variant: normal; font-weight: normal; letter-spacing: normal;
      line-height: 26px; orphans: 2; text-align: start; text-indent:
      0px; text-transform: none; white-space: nowrap; widows: 2;
      word-spacing: 0px; -webkit-text-size-adjust: auto;
      -webkit-text-stroke-width: 0px; background-color: rgb(102, 102,
      102); display: inline !important; float: none; ">Manual fencing
      for host Hyper01 was started.</span><br>
   =20
    <span style=3D"color: rgb(255, 255, 255); font-family: 'Arial Unicode
      MS', Arial, sans-serif; font-size: small; font-style: normal;
      font-variant: normal; font-weight: normal; letter-spacing: normal;
      line-height: 26px; orphans: 2; text-align: start; text-indent:
      0px; text-transform: none; white-space: nowrap; widows: 2;
      word-spacing: 0px; -webkit-text-size-adjust: auto;
      -webkit-text-stroke-width: 0px; background-color: rgb(102, 102,
      102); display: inline !important; float: none; ">VM Web-Frontend01
      was restarted on Host Hyper02</span><br>
    <br>
    <br>
    I would like you approach on this problem, reading the documentation
    & features pages on the official website, I suppose that this
    would have been an automatically mechanism working on some sort of a
    vdsm & engine fencing action. Am I missing something regarding
    it ?<br>
    <br>
    <br>
    Thank you for your patience reading this.<br>
    <br>
    <br>
    Regards,<br>
    Alex.<br>
    <br>
    <br>
    <br>
 =20

<br>_______________________________________________<br>Users mailing list<b=
r><a href=3D"mailto:Users@ovirt.org" target=3D"_blank">Users@ovirt.org</a><=
br>http://lists.ovirt.org/mailman/listinfo/users<br></blockquote>Hi Alex,<b=
r>Can you share with us the engine's log from the relevant time period?<br>=
<br>Doron<br></div></div></blockquote></blockquote>Hi Alex,<br>engine log i=
s the important one, as it will indicate on the decision making process.<br=
...
VDSM logs should be kept in case something is unclear, but I suggest we be=
gin with<br>engine.log.<br><br></div></body></html>
------=_Part_3933742_65602238.1358067259763--