<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 01/14/2013 10:13 AM, Doron Fediuck

      wrote:<br>

    </div>

    <blockquote

      cite="mid:1145936324.4069952.1358151211713.JavaMail.root@redhat.com"

      type="cite">

      <style type="text/css">p { margin: 0; }</style>

      <div style="font-family: times new roman,new york,times,serif;

        font-size: 12pt; color: #000000"><br>

        <br>

        <hr id="zwchr">

        <blockquote style="border-left:2px solid rgb(16, 16,

255);margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From:

          </b>"Alexandru Vladulescu" <a class="moz-txt-link-rfc2396E" href="mailto:avladulescu@bfproject.ro">&lt;avladulescu@bfproject.ro&gt;</a><br>

          <b>To: </b>"Doron Fediuck" <a class="moz-txt-link-rfc2396E" href="mailto:dfediuck@redhat.com">&lt;dfediuck@redhat.com&gt;</a><br>

          <b>Cc: </b>"users" <a class="moz-txt-link-rfc2396E" href="mailto:users@ovirt.org">&lt;users@ovirt.org&gt;</a><br>

          <b>Sent: </b>Sunday, January 13, 2013 9:49:25 PM<br>

          <b>Subject: </b>Re: [Users] Testing High Availability and

          Power outages<br>

          <br>

          <div class="moz-cite-prefix"><br>

            Dear Doron,<br>

            <br>

            <br>

            I had the case retested now and I am writing you the

            results.<br>

            <br>

            Furthermore, if this information should be useful for you,

            my network setup is the following: 2 Layer 2 (Zyxel es2108-g

            &amp; ES2200-8) switches configured with 2 VLANs ( 1 inside

            backbone network -- added br0 to Ovirt ; 1 outside network

            -- running on ovirtmgmt interface for Internet traffic to

            VMs). The backbone switch is a gigabit capable one, and each

            host runs on jumbo frame setup. There is one more firewall

            server that routes the subnets through trunking port and

            VLAN configuration. The Ovirt software has been setup with

            backbone network subnet.<br>

            <br>

            As you could guess the network infrastructure is not the

            problem here.<br>

            <br>

            The test case was the same as described before:<br>

            <br>

            1. Vm running on Hyper01, none on Hyper02. Host had

            configured the High Available check box.<br>

            2. Hand power off of Hyper01 from power network (no

            soft/manual shutdown).<br>

            3. After a while, Ovirt marks the Hyper01 as Non Responsive<br>

            4. Manually clicked on Confirm host reboot and the VM starts

            after Ovirt's manual fence to Hyper01 on Hyper02 host.<br>

            <br>

            I have provided engine log attached. The Confirm Host reboot

            was done at precise time of 21:31:45 On the cluster section,

            in Ovirt, I did try changing the "Resilience Policy"

            attribute from "Migrate Virtual Machines" to "Migrate only

            High Available Virtual Machines" but with the same results.<br>

            <br>

            <br>

            As I am guessing from the engine log the Node Controller

            sees the Hyper01 node as it has a "network fault" no route

            to host, although this was shut down. <br>

            <br>

            Is this supposed to be the default behavior in this case, as

            the scenario might overlap with a real case of network

            outage.<br>

            <br>

            <br>

            My Regards,<br>

            Alex.<br>

            <br>

            <br>

            <br>

            On 01/13/2013 10:54 AM, Doron Fediuck wrote:<br>

          </div>

          <blockquote

            cite="mid:917703155.3933743.1358067259764.JavaMail.root@redhat.com">

            <style>p { margin: 0; }</style>

            <div style="font-family: times new roman,new

              york,times,serif; font-size: 12pt; color: #000000"><br>

              <br>

              <hr id="zwchr">

              <blockquote style="border-left:2px solid rgb(16, 16,

255);margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From:

                </b>"Alexandru Vladulescu" <a moz-do-not-send="true"

                  class="moz-txt-link-rfc2396E"

                  href="mailto:avladulescu@bfproject.ro" target="_blank">&lt;avladulescu@bfproject.ro&gt;</a><br>

                <b>To: </b>"Doron Fediuck" <a moz-do-not-send="true"

                  class="moz-txt-link-rfc2396E"

                  href="mailto:dfediuck@redhat.com" target="_blank">&lt;dfediuck@redhat.com&gt;</a><br>

                <b>Cc: </b>"users" <a moz-do-not-send="true"

                  class="moz-txt-link-rfc2396E"

                  href="mailto:users@ovirt.org" target="_blank">&lt;users@ovirt.org&gt;</a><br>

                <b>Sent: </b>Sunday, January 13, 2013 10:46:41 AM<br>

                <b>Subject: </b>Re: [Users] Testing High Availability

                and Power outages<br>

                <br>

                <div>Dear Doron,</div>

                <div><br>

                </div>

                <div>I haven't collected the logs from the tests, but I

                  would gladly re-do the case and get back to you asap.&nbsp;</div>

                <div><br>

                </div>

                <div>This feature is the main reason of which I have

                  chosen to go with Ovirt in the first place, besides

                  other virt environments.</div>

                <div><br>

                </div>

                <div>Could you please inform me what logs should I be

                  focusing on,&nbsp;<span class="Apple-style-span"

                    style="-webkit-tap-highlight-color: rgba(26, 26, 26,

                    0.296875); -webkit-composition-fill-color: rgba(175,

                    192, 227, 0.230469);

                    -webkit-composition-frame-color: rgba(77, 128, 180,

                    0.230469); ">besides the engine log; vdsm maybe or

                    other relevant logs?</span></div>

                <div><br>

                  <div>

                    <div>Regards,</div>

                    <div>Alex</div>

                  </div>

                  <div><br>

                  </div>

                  <div><br>

                  </div>

                  <div><span class="Apple-style-span"

                      style="-webkit-tap-highlight-color: rgba(26, 26,

                      26, 0.292969); -webkit-composition-fill-color:

                      rgba(175, 192, 227, 0.230469);

                      -webkit-composition-frame-color: rgba(77, 128,

                      180, 0.230469);">--</span></div>

                  <div><span class="Apple-style-span"

                      style="-webkit-tap-highlight-color: rgba(26, 26,

                      26, 0.296875); -webkit-composition-fill-color:

                      rgba(175, 192, 227, 0.230469);

                      -webkit-composition-frame-color: rgba(77, 128,

                      180, 0.230469); ">Sent from phone.</span></div>

                </div>

                <div><br>

                  On 13.01.2013, at 09:56, Doron Fediuck &lt;<a

                    moz-do-not-send="true"

                    href="mailto:dfediuck@redhat.com" target="_blank">dfediuck@redhat.com</a>&gt;

                  wrote:<br>

                  <br>

                </div>

                <blockquote>

                  <div>

                    <div style="font-family: times new roman,new

                      york,times,serif; font-size: 12pt; color: #000000"><br>

                      <br>

                      <hr id="zwchr">

                      <blockquote style="border-left:2px solid rgb(16,

                        16,

255);margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From:

                        </b>"Alexandru Vladulescu" &lt;<a

                          moz-do-not-send="true"

                          href="mailto:avladulescu@bfproject.ro"

                          target="_blank">avladulescu@bfproject.ro</a>&gt;<br>

                        <b>To: </b>"users" &lt;<a

                          moz-do-not-send="true"

                          href="mailto:users@ovirt.org" target="_blank">users@ovirt.org</a>&gt;<br>

                        <b>Sent: </b>Friday, January 11, 2013 2:47:38

                        PM<br>

                        <b>Subject: </b>[Users] Testing High

                        Availability and Power outages<br>

                        <br>

                        <br>

                        Hi,<br>

                        <br>

                        <br>

                        Today, I started testing on my Ovirt 3.1

                        installation (from dreyou repos) running on 3 x

                        Centos 6.3 hypervisors the High Availability

                        features and the fence mechanism.<br>

                        <br>

                        As yesterday, I have reported in a previous

                        email thread, that the migration priority queue

                        cannot be increased (bug) in this current

                        version, I decided to test what the official

                        documentation says about the High Availability

                        cases. <br>

                        <br>

                        This will be a disaster case scenarios to suffer

                        from if one hypervisor has a power

                        outage/hardware problem and the VMs running on

                        it are not migrating on other spare resources.<br>

                        <br>

                        <br>

                        In the official documenation from <a

                          moz-do-not-send="true" href="http://ovirt.org"

                          target="_blank">ovirt.org</a> it is quoted the

                        following:<br>

                        <h3> <span class="mw-headline"

                            id="High_availability"> <font

                              color="#333399"><i><small>High

                                  availability </small></i></font></span></h3>

                        <font color="#333399"><i><small> </small></i></font>

                        <p><font color="#333399"><i><small>Allows

                                critical VMs to be restarted on another

                                host in the event of hardware failure

                                with three levels of priority, taking

                                into account resiliency policy. </small></i></font></p>

                        <font color="#333399"><i><small> </small></i></font>

                        <ul>

                          <li><font color="#333399"><i><small>

                                  Resiliency policy to control high

                                  availability VMs at the cluster level.

                                </small></i></font></li>

                          <li><font color="#333399"><i><small> Supports

                                  application-level high availability

                                  with supported fencing agents. </small></i></font></li>

                        </ul>

                        <br>

                        As well as in the Architecture description:<br>

                        <font color="#333399"><br>

                          <small><i>High Availability - restart guest

                              VMs from failed hosts automatically on

                              other hosts</i></small></font><br>

                        <br>

                        <br>

                        <br>

                        So the testing went like this -- One VM running

                        a linux box, having the check box "High

                        Available" and "Priority for Run/Migration

                        queue:" set to Low. On Host we have the check

                        box to "Any Host in Cluster", without "Allow VM

                        migration only upon Admin specific request"

                        checked.<br>

                        <br>

                        <br>

                        <br>

                        My environment:<br>

                        <br>

                        <br>

                        Configuration :&nbsp; 2 x Hypervisors (same

                        cluster/hardware configuration) ; 1 x Hypervisor

                        + acting as a NAS (NFS) server (different

                        cluster/hardware configuration)<br>

                        <br>

                        Actions: Went and cut-off the power from one of

                        the hypervisors from the 2 node clusters, while

                        the VM was running on. This would translate to a

                        power outage.<br>

                        <br>

                        Results: The hypervisor node that suffered from

                        the outage is showing in Hosts tab as Non

                        Responsive on Status, and the VM has a question

                        mark and cannot be powered off or nothing

                        (therefore it's stuck).<br>

                        <br>

                        In the Log console in GUI, I get: <br>

                        <br>

                        <span style="color: rgb(255, 255, 255);

                          font-family: 'Arial Unicode MS', Arial,

                          sans-serif; font-size: small; font-style:

                          normal; font-variant: normal; font-weight:

                          normal; letter-spacing: normal; line-height:

                          26px; orphans: 2; text-align: start;

                          text-indent: 0px; text-transform: none;

                          white-space: nowrap; widows: 2; word-spacing:

                          0px; -webkit-text-size-adjust: auto;

                          -webkit-text-stroke-width: 0px;

                          background-color: rgb(102, 102, 102); display:

                          inline !important; float: none; ">Host Hyper01

                          is non-responsive.</span><br>

                        <span style="color: rgb(255, 255, 255);

                          font-family: 'Arial Unicode MS', Arial,

                          sans-serif; font-size: small; font-style:

                          normal; font-variant: normal; font-weight:

                          normal; letter-spacing: normal; line-height:

                          26px; orphans: 2; text-align: start;

                          text-indent: 0px; text-transform: none;

                          white-space: nowrap; widows: 2; word-spacing:

                          0px; -webkit-text-size-adjust: auto;

                          -webkit-text-stroke-width: 0px;

                          background-color: rgb(102, 102, 102); display:

                          inline !important; float: none; ">VM

                          Web-Frontend01 was set to the Unknown status.</span><br>

                        <br>

                        There is nothing I could I could do besides

                        clicking on the Hyper01 "Confirm Host as been

                        rebooted", afterwards the VM starts on the

                        Hyper02 with a cold reboot of the VM.<br>

                        <br>

                        The Log console changes to:<br>

                        <br>

                        <span style="color: rgb(255, 255, 255);

                          font-family: 'Arial Unicode MS', Arial,

                          sans-serif; font-size: small; font-style:

                          normal; font-variant: normal; font-weight:

                          normal; letter-spacing: normal; line-height:

                          26px; orphans: 2; text-align: start;

                          text-indent: 0px; text-transform: none;

                          white-space: nowrap; widows: 2; word-spacing:

                          0px; -webkit-text-size-adjust: auto;

                          -webkit-text-stroke-width: 0px;

                          background-color: rgb(102, 102, 102); display:

                          inline !important; float: none; ">Vm

                          Web-Frontend01 was shut down due to Hyper01

                          host reboot or manual fence</span><br>

                        <span style="color: rgb(255, 255, 255);

                          font-family: 'Arial Unicode MS', Arial,

                          sans-serif; font-size: small; font-style:

                          normal; font-variant: normal; font-weight:

                          normal; letter-spacing: normal; line-height:

                          26px; orphans: 2; text-align: start;

                          text-indent: 0px; text-transform: none;

                          white-space: nowrap; widows: 2; word-spacing:

                          0px; -webkit-text-size-adjust: auto;

                          -webkit-text-stroke-width: 0px;

                          background-color: rgb(102, 102, 102); display:

                          inline !important; float: none; ">All VMs'

                          status on Non-Responsive Host Hyper01 were

                          changed to 'Down' by admin@internal</span><br>

                        <span style="color: rgb(255, 255, 255);

                          font-family: 'Arial Unicode MS', Arial,

                          sans-serif; font-size: small; font-style:

                          normal; font-variant: normal; font-weight:

                          normal; letter-spacing: normal; line-height:

                          26px; orphans: 2; text-align: start;

                          text-indent: 0px; text-transform: none;

                          white-space: nowrap; widows: 2; word-spacing:

                          0px; -webkit-text-size-adjust: auto;

                          -webkit-text-stroke-width: 0px;

                          background-color: rgb(102, 102, 102); display:

                          inline !important; float: none; ">Manual

                          fencing for host Hyper01 was started.</span><br>

                        <span style="color: rgb(255, 255, 255);

                          font-family: 'Arial Unicode MS', Arial,

                          sans-serif; font-size: small; font-style:

                          normal; font-variant: normal; font-weight:

                          normal; letter-spacing: normal; line-height:

                          26px; orphans: 2; text-align: start;

                          text-indent: 0px; text-transform: none;

                          white-space: nowrap; widows: 2; word-spacing:

                          0px; -webkit-text-size-adjust: auto;

                          -webkit-text-stroke-width: 0px;

                          background-color: rgb(102, 102, 102); display:

                          inline !important; float: none; ">VM

                          Web-Frontend01 was restarted on Host Hyper02</span><br>

                        <br>

                        <br>

                        I would like you approach on this problem,

                        reading the documentation &amp; features pages

                        on the official website, I suppose that this

                        would have been an automatically mechanism

                        working on some sort of a vdsm &amp; engine

                        fencing action. Am I missing something regarding

                        it ?<br>

                        <br>

                        <br>

                        Thank you for your patience reading this.<br>

                        <br>

                        <br>

                        Regards,<br>

                        Alex.<br>

                        <br>

                        <br>

                        <br>

                        <br>

                        _______________________________________________<br>

                        Users mailing list<br>

                        <a moz-do-not-send="true"

                          href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br>

                        <a moz-do-not-send="true"

                          class="moz-txt-link-freetext"

                          href="http://lists.ovirt.org/mailman/listinfo/users"

                          target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br>

                      </blockquote>

                      Hi Alex,<br>

                      Can you share with us the engine's log from the

                      relevant time period?<br>

                      <br>

                      Doron<br>

                    </div>

                  </div>

                </blockquote>

              </blockquote>

              Hi Alex,<br>

              engine log is the important one, as it will indicate on

              the decision making process.<br>

              VDSM logs should be kept in case something is unclear, but

              I suggest we begin with<br>

              engine.log.<br>

              <br>

            </div>

          </blockquote>

          <br>

        </blockquote>

        Hi Alex,in tab, rig<br>

        In order to have HA working in host level (which is what you're

        testing now) you need to<br>

        configure power management to each of the relevant hosts (Go to

        Hosts maht click a host<br>

        and choose edit. Now select the Power management tab and you'll

        see it). In the details you<br>

        gave us it's not clear how you defined Power management for your

        hosts, so I can only assume<br>

        it's not defined properly.<br>

        <br>

        The reason for this necessity is that we cannot resume a VM on a

        different host before we<br>

        verified the original hosts status. If, for example the VM is

        still running on the original<br>

        host and we lost network connectivity to it, we're in a risk of

        running the same VM on 2 different<br>

        hosts at the same time which will corrupt its disk(s). So the

        only way to prevent it, is<br>

        rebooting the original host which will ensure the VM is not

        running there. We call the reboot<br>

        procedure fencing, and if you'll check your logs you'll be able

        to see:<br>

        <br>

        2013-01-13 21:29:42,380 ERROR

        [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand]

        (pool-3-thread-44) [a1803d1] Failed to run Fence script on

        vds:Hyper01, VMs moved to UnKnown instead.<br>

        <br>

        So the only way for you to handle it, is to confirm host was

        rebooted (as you did), which will<br>

        allow resuming the VM on a different host.<br>

        <br>

        Doron<br>

      </div>

    </blockquote>

    <br>

    Hi Doron,<br>

    <br>

    Regarding your reply I don't have such fence mechanism through IMM

    or iLO interface as the hardware that I am using doesn't support

    such IPMI technology. Seeing your response makes me consider the

    option of really getting an add-on card that will be able to do the

    basic reboot, restart, reset functions for our hardware.<br>

    <br>

    Thank you very much for your advice on this.<br>

    <br>

    Alex<br>

    <br>

    <br>

  </body>

</html>