Re: [ovirt-users] Decrease downtime for HA

12 Apr 2018


      --Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8
...
On 12 Apr 2018, at 13:13, Daniel Menzel =
<daniel.menzel@hhi.fraunhofer.de> wrote:
=20
Hi there,
=20
does anyone have an idea how to decrease a virtual machine's downtime?
=20
Best
Daniel
=20
On 06.04.2018 13:34, Daniel Menzel wrote:
...
Hi Michal,
=20
=20
Hi Daniel,
adding Martin to review fencing behavior
...
...
(sorry for misspelling your name in my first mail).
=20
=20
that=E2=80=99s not the reason I=E2=80=99m replying late!:-))
...
...
The settings for the VMs are the following (oVirt 4.2):
=20
HA checkbox enabled of course
"Target Storage Domain for VM Lease" -> left empty
...
...
"Resume Behavior" -> AUTO_RESUME
Priority for Migration -> High
"Watchdog Model" -> No-Watchdog
For testing we did not kill any VM but the host. So basically we =
simulated an instantaneous crash by manually turning the machine off via =
IPMI-Interface (not via operating system!) and ping the guest(s). What =
happens then?
=20
2-3 seconds after the we press the host's shutdown button we lose =
...
...
After another 20s oVirt changes the host's status to "connecting", =
if you need faster reactions then try to use VM Leases as well, it =
won=E2=80=99t make a difference in this case but will help in case of =
network issues. E.g. if you use iSCSI and the storage connection breaks =
while host connection still works it would restart the VM in about 80s; =
otherwise it would take >5 mins.=20
ping contact to the VM(s).
the VM's status is set to a question mark.
...
...
After ~1:30 the host is flagged to "non responsive=E2=80=9D
that sounds about right. Now fencing action should have been initiated, =
if you can share the engine logs we can confirm that. IIRC we first try =
soft fencing - try to ssh to that host, that might take some time to =
time out I guess. Martin?
...
...
After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the =
guest is back online.
So, there seems to be one mistake I made in the first mail: The =
downtime is "only" 2.5min. But still I think this time can be decreased =
as for some services it is still quite a long time.
=20
=20
...
...
Best
Daniel
=20
On 06.04.2018 12:49, Michal Skrivanek wrote:
...
...
On 6 Apr 2018, at 12:45, Daniel Menzel =
<daniel.menzel@hhi.fraunhofer.de> =
<mailto:daniel.menzel@hhi.fraunhofer.de> wrote:
=20
Hi Michael,
thanks for your mail. Sorry, I forgot to write that. Yes, we have =
these values can be tuned down, but then you may be more susceptible to =
fencing power cycling a host in case of shorter network outages. It may =
be ok=E2=80=A6depending on your requirements.
power management and fencing enabled on all hosts. We also tested this =
and found out that it works perfectly. So this cannot be the reason I =
guess.
...
...
...
Hi Daniel,
ok, then it=E2=80=99s worth looking into details. Can you describe =
in more detail what happens? What exact settings you=E2=80=99re using =
for such VM? Are you killing the HE VM or other VMs or both? Would be =
good to narrow it down a bit and then review the exact flow
=20
Thanks,
michal
=20
...
Daniel
=20
=20
=20
On 06.04.2018 11:11, Michal Skrivanek wrote:
...
...
On 4 Apr 2018, at 15:36, Daniel Menzel =
<daniel.menzel@hhi.fraunhofer.de> =
<mailto:daniel.menzel@hhi.fraunhofer.de> wrote:
=20
Hello,
=20
we're successfully using a setup with 4 Nodes and a replicated =
Gluster for storage. The engine is self hosted. What we're dealing with =
at the moment is the high availability: If a node fails (for example =
simulated by a forced power loss) the engine comes back up online =
withing ~2min. But guests (having the HA option enabled) come back =
online only after a very long grace time of ~5min. As we have a reliable =
network (40 GbE) and reliable servers I think that the default grace =
times are way too high for us - is there any possibility to change those =
values?
And do you have Power Management(iLO, iDRAC,etc) configured for =
your hosts? Otherwise we have to resort to relatively long timeouts to =
make sure the host is really dead
Thanks,
michal
Thanks in advance!
Daniel
=20
_______________________________________________
Users mailing list
Users@ovirt.org <mailto:Users@ovirt.org>
http://lists.ovirt.org/mailman/listinfo/users =
<http://lists.ovirt.org/mailman/listinfo/users>
=20
=20
=20
=20
=20

Users mailing list
Users@ovirt.org <mailto:Users@ovirt.org>
http://lists.ovirt.org/mailman/listinfo/users =
<http://lists.ovirt.org/mailman/listinfo/users>
=20

Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
--Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""><br =
class=3D""><div><br class=3D""><blockquote type=3D"cite" class=3D""><div =
class=3D"">On 12 Apr 2018, at 13:13, Daniel Menzel <<a =
href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" =
class=3D"">daniel.menzel@hhi.fraunhofer.de</a>> wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D"">
 =20
    <meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8" class=3D"">
 =20
  <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><p class=3D"">Hi =
there,</p><p class=3D"">does anyone have an idea how to decrease a =
virtual machine's
      downtime?</p><p class=3D"">Best<br class=3D"">
      Daniel<br class=3D"">
    </p>
    <br class=3D"">
    <div class=3D"moz-cite-prefix">On 06.04.2018 13:34, Daniel Menzel
      wrote:<br class=3D"">
    </div>
    <blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D"">
      <meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8" class=3D""><p class=3D"">Hi Michal,</p><div =
class=3D""><br =
class=3D""></div></blockquote></div></div></blockquote><div><br =
class=3D""></div>Hi Daniel,</div><div>adding Martin to review fencing =
behavior<br class=3D""><blockquote type=3D"cite" class=3D""><div =
class=3D""><div text=3D"#000000" bgcolor=3D"#FFFFFF" =
class=3D""><blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><p class=3D"">(sorry for misspelling your name in my first =
mail).</p><div class=3D""><br =
class=3D""></div></blockquote></div></div></blockquote><br =
class=3D"">that=E2=80=99s not the reason I=E2=80=99m replying =
late!:-))</div><div><br class=3D""><blockquote type=3D"cite" =
class=3D""><div class=3D""><div text=3D"#000000" bgcolor=3D"#FFFFFF" =
class=3D""><blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><p class=3D"">The settings for the VMs are the following =
(oVirt 4.2):</p>
      <ol class=3D"">
        <li class=3D"">HA checkbox enabled of course</li>
        <li class=3D"">"Target Storage Domain for VM Lease" -> left =
empty</li></ol></blockquote></div></div></blockquote><div><br =
class=3D""></div>if you need faster reactions then try to use VM Leases =
as well, it won=E2=80=99t make a difference in this case but will help =
in case of network issues. E.g. if you use iSCSI and the storage =
connection breaks while host connection still works it would restart the =
VM in about 80s; otherwise it would take >5 mins. <br =
class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div =
text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><ol class=3D"" start=3D"3">
        <li class=3D"">"Resume Behavior" -> AUTO_RESUME</li>
        <li class=3D"">Priority for Migration -> High<br class=3D"">
        </li>
        <li class=3D"">"Watchdog Model" -> No-Watchdog</li>
      </ol><p class=3D"">For testing we did not kill any VM but the =
host. So basically
        we simulated an instantaneous crash by manually turning the
        machine off via IPMI-Interface (not via operating system!) and
        ping the guest(s). What happens then?</p>
      <ol class=3D"">
        <li class=3D"">2-3 seconds after the we press the host's =
shutdown button we
          lose ping contact to the VM(s).</li>
        <li class=3D"">After another 20s oVirt changes the host's status =
to
          "connecting", the VM's status is set to a question mark.</li>
        <li class=3D"">After ~1:30 the host is flagged to "non =
responsive=E2=80=9D</li></ol></blockquote></div></div></blockquote><div><b=
r class=3D""></div>that sounds about right. Now fencing action should =
have been initiated, if you can share the engine logs we can confirm =
that. IIRC we first try soft fencing - try to ssh to that host, that =
might take some time to time out I guess. Martin?<br =
class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div =
text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><ol class=3D"" start=3D"3"><li class=3D"">
        </li>
        <li class=3D"">After ~2:10 the host's reboot is initiated by =
oVirt, 5-10s
          later the guest is back online.</li>
      </ol><p class=3D"">So, there seems to be one mistake I made in the =
first mail: The
        downtime is "only" 2.5min. But still I think this time can be
        decreased as for some services it is still quite a long =
time.</p><div class=3D""><br =
class=3D""></div></blockquote></div></div></blockquote><div><br =
class=3D""></div>these values can be tuned down, but then you may be =
more susceptible to fencing power cycling a host in case of shorter =
network outages. It may be ok=E2=80=A6depending on your requirements.<br =
class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div =
text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><p class=3D"">Best<br class=3D"">
        Daniel<br class=3D"">
      </p>
      <br class=3D"">
      <div class=3D"moz-cite-prefix">On 06.04.2018 12:49, Michal =
Skrivanek
        wrote:<br class=3D"">
      </div>
      <blockquote type=3D"cite" =
cite=3D"mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com" class=3D"">
        <pre wrap=3D"" class=3D""></pre>
        <blockquote type=3D"cite" class=3D"">
          <pre wrap=3D"" class=3D"">On 6 Apr 2018, at 12:45, Daniel =
Menzel <a class=3D"moz-txt-link-rfc2396E" =
href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" =
moz-do-not-send=3D"true"><daniel.menzel@hhi.fraunhofer.de></a> =
wrote:

Hi Michael,
thanks for your mail. Sorry, I forgot to write that. Yes, we have power =
management and fencing enabled on all hosts. We also tested this and =
found out that it works perfectly. So this cannot be the reason I guess.
</pre>
        </blockquote>
        <pre wrap=3D"" class=3D"">Hi Daniel,
ok, then it=E2=80=99s worth looking into details. Can you describe in =
more detail what happens? What exact settings you=E2=80=99re using for =
such VM? Are you killing the HE VM or other VMs or both? Would be good =
to narrow it down a bit and then review the exact flow

Thanks,
michal

</pre>
        <blockquote type=3D"cite" class=3D"">
          <pre wrap=3D"" class=3D"">Daniel


On 06.04.2018 11:11, Michal Skrivanek wrote:
</pre>
          <blockquote type=3D"cite" class=3D"">
            <blockquote type=3D"cite" class=3D"">
              <pre wrap=3D"" class=3D"">On 4 Apr 2018, at 15:36, Daniel =
Menzel <a class=3D"moz-txt-link-rfc2396E" =
href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" =
moz-do-not-send=3D"true"><daniel.menzel@hhi.fraunhofer.de></a> =
wrote:

Hello,

we're successfully using a setup with 4 Nodes and a replicated Gluster =
for storage. The engine is self hosted. What we're dealing with at the =
moment is the high availability: If a node fails (for example simulated =
by a forced power loss) the engine comes back up online withing ~2min. =
But guests (having the HA option enabled) come back online only after a =
very long grace time of ~5min. As we have a reliable network (40 GbE) =
and reliable servers I think that the default grace times are way too =
high for us - is there any possibility to change those values?
</pre>
            </blockquote>
            <pre wrap=3D"" class=3D"">And do you have Power =
Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have =
to resort to relatively long timeouts to make sure the host is really =
dead
Thanks,
michal
</pre>
            <blockquote type=3D"cite" class=3D"">
              <pre wrap=3D"" class=3D"">Thanks in advance!
Daniel

_______________________________________________
Users mailing list
<a class=3D"moz-txt-link-abbreviated" href=3D"mailto:Users@ovirt.org" =
moz-do-not-send=3D"true">Users@ovirt.org</a>
<a class=3D"moz-txt-link-freetext" =
href=3D"http://lists.ovirt.org/mailman/listinfo/users" =
moz-do-not-send=3D"true">http://lists.ovirt.org/mailman/listinfo/users</a>=


</pre>
            </blockquote>
          </blockquote>
        </blockquote>
      </blockquote>
      <br class=3D"">
      <br class=3D"">
      <fieldset class=3D"mimeAttachmentHeader"></fieldset>
      <br class=3D"">
      <pre wrap=3D"" =
class=3D"">_______________________________________________
Users mailing list
<a class=3D"moz-txt-link-abbreviated" =
href=3D"mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class=3D"moz-txt-link-freetext" =
href=3D"http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.=
org/mailman/listinfo/users</a>
</pre>
    </blockquote>
    <br class=3D"">
  </div>

_______________________________________________<br class=3D"">Users =
mailing list<br class=3D""><a href=3D"mailto:Users@ovirt.org" =
class=3D"">Users@ovirt.org</a><br =
class=3D"">http://lists.ovirt.org/mailman/listinfo/users<br =
class=3D""></div></blockquote></div><br class=3D""></body></html>=

--Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7--