Re: [ovirt-users] Decrease downtime for HA

23 Apr 2018

      This is a multi-part message in MIME format.
--------------3D697365BAFF211CE5F73438
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

Hi Michal,

in your last mail you wrote, that the values can be turned down - how 
can this be done?

Best
Daniel

On 12.04.2018 20:29, Michal Skrivanek wrote:
...
...
On 12 Apr 2018, at 13:13, Daniel Menzel 
<daniel.menzel@hhi.fraunhofer.de 
<mailto:daniel.menzel@hhi.fraunhofer.de>> wrote:
Hi there,
does anyone have an idea how to decrease a virtual machine's downtime?
Best
Daniel
On 06.04.2018 13:34, Daniel Menzel wrote:
...
Hi Michal,
Hi Daniel,
adding Martin to review fencing behavior
...
...
(sorry for misspelling your name in my first mail).
that’s not the reason I’m replying late!:-))
...
...
The settings for the VMs are the following (oVirt 4.2):
1. HA checkbox enabled of course
 2. "Target Storage Domain for VM Lease" -> left empty
if you need faster reactions then try to use VM Leases as well, it 
won’t make a difference in this case but will help in case of network 
issues. E.g. if you use iSCSI and the storage connection breaks while 
host connection still works it would restart the VM in about 80s; 
otherwise it would take >5 mins.
...
...
3. "Resume Behavior" -> AUTO_RESUME
 4. Priority for Migration -> High
 5. "Watchdog Model" -> No-Watchdog
For testing we did not kill any VM but the host. So basically we 
simulated an instantaneous crash by manually turning the machine off 
via IPMI-Interface (not via operating system!) and ping the 
guest(s). What happens then?
1. 2-3 seconds after the we press the host's shutdown button we
    lose ping contact to the VM(s).
 2. After another 20s oVirt changes the host's status to
    "connecting", the VM's status is set to a question mark.
 3. After ~1:30 the host is flagged to "non responsive”
that sounds about right. Now fencing action should have been 
initiated, if you can share the engine logs we can confirm that. IIRC 
we first try soft fencing - try to ssh to that host, that might take 
some time to time out I guess. Martin?
...
...
3.
4. After ~2:10 the host's reboot is initiated by oVirt, 5-10s later
    the guest is back online.
So, there seems to be one mistake I made in the first mail: The 
downtime is "only" 2.5min. But still I think this time can be 
decreased as for some services it is still quite a long time.
these values can be tuned down, but then you may be more susceptible 
to fencing power cycling a host in case of shorter network outages. It 
may be ok…depending on your requirements.
...
...
Best
Daniel
On 06.04.2018 12:49, Michal Skrivanek wrote:
...
...
On 6 Apr 2018, at 12:45, Daniel Menzel<daniel.menzel@hhi.fraunhofer.de>  wrote:
Hi Michael,
thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess.
Hi Daniel,
ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow
Thanks,
michal
...
Daniel
On 06.04.2018 11:11, Michal Skrivanek wrote:
...
> On 4 Apr 2018, at 15:36, Daniel Menzel<daniel.menzel@hhi.fraunhofer.de>  wrote:
>
> Hello,
>
> we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values?
And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead
Thanks,
michal
> Thanks in advance!
> Daniel
>
> _______________________________________________
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@ovirt.org <mailto:Users@ovirt.org>
http://lists.ovirt.org/mailman/listinfo/users
--------------3D697365BAFF211CE5F73438
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Hi Michal,</p>
    <p>in your last mail you wrote, that the values can be turned down -
      how can this be done?</p>
    <p>Best<br>
      Daniel<br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 12.04.2018 20:29, Michal Skrivanek
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:66C98419-3DE3-42CA-B03A-45038BFB10F4@redhat.com">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <br class="">
      <div><br class="">
        <blockquote type="cite" class="">
          <div class="">On 12 Apr 2018, at 13:13, Daniel Menzel <<a
              href="mailto:daniel.menzel@hhi.fraunhofer.de" class=""
              moz-do-not-send="true">daniel.menzel@hhi.fraunhofer.de</a>>
            wrote:</div>
          <br class="Apple-interchange-newline">
          <div class="">
            <div text="#000000" bgcolor="#FFFFFF" class="">
              <p class="">Hi there,</p>
              <p class="">does anyone have an idea how to decrease a
                virtual machine's downtime?</p>
              <p class="">Best<br class="">
                Daniel<br class="">
              </p>
              <br class="">
              <div class="moz-cite-prefix">On 06.04.2018 13:34, Daniel
                Menzel wrote:<br class="">
              </div>
              <blockquote type="cite"
                cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de"
                class="">
                <p class="">Hi Michal,</p>
                <div class=""><br class="">
                </div>
              </blockquote>
            </div>
          </div>
        </blockquote>
        <div><br class="">
        </div>
        Hi Daniel,</div>
      <div>adding Martin to review fencing behavior<br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div text="#000000" bgcolor="#FFFFFF" class="">
              <blockquote type="cite"
                cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de"
                class="">
                <p class="">(sorry for misspelling your name in my first
                  mail).</p>
                <div class=""><br class="">
                </div>
              </blockquote>
            </div>
          </div>
        </blockquote>
        <br class="">
        that’s not the reason I’m replying late!:-))</div>
      <div><br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div text="#000000" bgcolor="#FFFFFF" class="">
              <blockquote type="cite"
                cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de"
                class="">
                <p class="">The settings for the VMs are the following
                  (oVirt 4.2):</p>
                <ol class="">
                  <li class="">HA checkbox enabled of course</li>
                  <li class="">"Target Storage Domain for VM Lease"
                    -> left empty</li>
                </ol>
              </blockquote>
            </div>
          </div>
        </blockquote>
        <div><br class="">
        </div>
        if you need faster reactions then try to use VM Leases as well,
        it won’t make a difference in this case but will help in case of
        network issues. E.g. if you use iSCSI and the storage connection
        breaks while host connection still works it would restart the VM
        in about 80s; otherwise it would take >5 mins. <br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div text="#000000" bgcolor="#FFFFFF" class="">
              <blockquote type="cite"
                cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de"
                class="">
                <ol class="" start="3">
                  <li class="">"Resume Behavior" -> AUTO_RESUME</li>
                  <li class="">Priority for Migration -> High<br
                      class="">
                  </li>
                  <li class="">"Watchdog Model" -> No-Watchdog</li>
                </ol>
                <p class="">For testing we did not kill any VM but the
                  host. So basically we simulated an instantaneous crash
                  by manually turning the machine off via IPMI-Interface
                  (not via operating system!) and ping the guest(s).
                  What happens then?</p>
                <ol class="">
                  <li class="">2-3 seconds after the we press the host's
                    shutdown button we lose ping contact to the VM(s).</li>
                  <li class="">After another 20s oVirt changes the
                    host's status to "connecting", the VM's status is
                    set to a question mark.</li>
                  <li class="">After ~1:30 the host is flagged to "non
                    responsive”</li>
                </ol>
              </blockquote>
            </div>
          </div>
        </blockquote>
        <div><br class="">
        </div>
        that sounds about right. Now fencing action should have been
        initiated, if you can share the engine logs we can confirm that.
        IIRC we first try soft fencing - try to ssh to that host, that
        might take some time to time out I guess. Martin?<br class="">
        <blockquote type="cite" class="">
          <div class="">
            <div text="#000000" bgcolor="#FFFFFF" class="">
              <blockquote type="cite"
                cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de"
                class="">
                <ol class="" start="3">
                  <li class=""> <br>
                  </li>
                  <li class="">After ~2:10 the host's reboot is
                    initiated by oVirt, 5-10s later the guest is back
                    online.</li>
                </ol>
                <p class="">So, there seems to be one mistake I made in
                  the first mail: The downtime is "only" 2.5min. But
                  still I think this time can be decreased as for some
                  services it is still quite a long time.</p>
                <div class=""><br class="">
                </div>
              </blockquote>
            </div>
          </div>
        </blockquote>
        <div><br class="">
        </div>
        these values can be tuned down, but then you may be more
        susceptible to fencing power cycling a host in case of shorter
        network outages. It may be ok…depending on your requirements.<br
          class="">
        <blockquote type="cite" class="">
          <div class="">
            <div text="#000000" bgcolor="#FFFFFF" class="">
              <blockquote type="cite"
                cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de"
                class="">
                <p class="">Best<br class="">
                  Daniel<br class="">
                </p>
                <br class="">
                <div class="moz-cite-prefix">On 06.04.2018 12:49, Michal
                  Skrivanek wrote:<br class="">
                </div>
                <blockquote type="cite"
                  cite="mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com"
                  class="">
                  <blockquote type="cite" class="">
                    <pre class="" wrap="">On 6 Apr 2018, at 12:45, Daniel Menzel <a class="moz-txt-link-rfc2396E" href="mailto:daniel.menzel@hhi.fraunhofer.de" moz-do-not-send="true"><daniel.menzel@hhi.fraunhofer.de></a> wrote:

Hi Michael,
thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess.
</pre>
                  </blockquote>
                  <pre class="" wrap="">Hi Daniel,
ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow

Thanks,
michal

</pre>
                  <blockquote type="cite" class="">
                    <pre class="" wrap="">Daniel

On 06.04.2018 11:11, Michal Skrivanek wrote:
</pre>
                    <blockquote type="cite" class="">
                      <blockquote type="cite" class="">
                        <pre class="" wrap="">On 4 Apr 2018, at 15:36, Daniel Menzel <a class="moz-txt-link-rfc2396E" href="mailto:daniel.menzel@hhi.fraunhofer.de" moz-do-not-send="true"><daniel.menzel@hhi.fraunhofer.de></a> wrote:

Hello,

we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values?
</pre>
                      </blockquote>
                      <pre class="" wrap="">And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead
Thanks,
michal
</pre>
                      <blockquote type="cite" class="">
                        <pre class="" wrap="">Thanks in advance!
Daniel

_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org" moz-do-not-send="true">Users@ovirt.org</a>
<a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users" moz-do-not-send="true">http://lists.ovirt.org/mailman/listinfo/users</a>

</pre>
                      </blockquote>
                    </blockquote>
                  </blockquote>
                </blockquote>
                <br class="">
                <br class="">
                <fieldset class="mimeAttachmentHeader"></fieldset>
                <br class="">
                <pre class="" wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org" moz-do-not-send="true">Users@ovirt.org</a>
<a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users" moz-do-not-send="true">http://lists.ovirt.org/mailman/listinfo/users</a>
</pre>
              </blockquote>
              <br class="">
            </div>
            _______________________________________________<br class="">
            Users mailing list<br class="">
            <a href="mailto:Users@ovirt.org" class=""
              moz-do-not-send="true">Users@ovirt.org</a><br class="">
            <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a><br class="">
          </div>
        </blockquote>
      </div>
      <br class="">
    </blockquote>
    <br>
  </body>
</html>

--------------3D697365BAFF211CE5F73438--