--Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=utf-8
On 12 Apr 2018, at 13:13, Daniel Menzel =
<daniel.menzel(a)hhi.fraunhofer.de> wrote:
=20
Hi there,
=20
does anyone have an idea how to decrease a virtual machine's downtime?
=20
Best
Daniel
=20
On 06.04.2018 13:34, Daniel Menzel wrote:
> Hi Michal,
>=20
>=20
Hi Daniel,
adding Martin to review fencing behavior
> (sorry for misspelling your name in my first mail).
>=20
>=20
that=E2=80=99s not the reason I=E2=80=99m replying late!:-))
> The settings for the VMs are the following (oVirt 4.2):
>=20
> HA checkbox enabled of course
> "Target Storage Domain for VM Lease" -> left empty
if you need faster reactions then try to use VM Leases as well, it =
won=E2=80=99t make a difference in this case but will help in case of =
network issues. E.g. if you use iSCSI and the storage connection breaks =
while host connection still works it would restart the VM in about 80s; =
otherwise it would take >5 mins.=20
> "Resume Behavior" -> AUTO_RESUME
> Priority for Migration -> High
> "Watchdog Model" -> No-Watchdog
> For testing we did not kill any VM but the host. So basically we =
simulated an
instantaneous crash by manually turning the machine off via =
IPMI-Interface (not via operating system!) and ping the guest(s). What =
happens then?
>=20
> 2-3 seconds after the we press the host's shutdown button we lose =
ping
contact to the VM(s).
> After another 20s oVirt changes the host's status to
"connecting", =
the VM's status is set to a question mark.
> After ~1:30 the host is flagged to "non responsive=E2=80=9D
that sounds about right. Now fencing action should have been initiated, =
if you can share the engine logs we can confirm that. IIRC we first try =
soft fencing - try to ssh to that host, that might take some time to =
time out I guess. Martin?
> After ~2:10 the host's reboot is initiated by oVirt, 5-10s
later the =
guest is back online.
> So, there seems to be one mistake I made in the first mail: The =
downtime is "only" 2.5min. But still I think this time can be decreased =
as for some services it is still quite a long time.
>=20
>=20
these values can be tuned down, but then you may be more susceptible to =
fencing power cycling a host in case of shorter network outages. It may =
be ok=E2=80=A6depending on your requirements.
> Best
> Daniel
>=20
> On 06.04.2018 12:49, Michal Skrivanek wrote:
>>> On 6 Apr 2018, at 12:45, Daniel Menzel =
<daniel.menzel(a)hhi.fraunhofer.de> =
<mailto:daniel.menzel@hhi.fraunhofer.de> wrote:
>>>=20
>>> Hi Michael,
>>> thanks for your mail. Sorry, I forgot to write that. Yes, we have =
power management and fencing enabled on all hosts. We also tested this =
and found out that it works perfectly. So this cannot be the reason I =
guess.
>> Hi Daniel,
>> ok, then it=E2=80=99s worth looking into details. Can you describe =
in
more detail what happens? What exact settings you=E2=80=99re using =
for such VM? Are you killing the HE VM or other VMs or both? Would be =
good to narrow it down a bit and then review the exact flow
>>=20
>> Thanks,
>> michal
>>=20
>>> Daniel
>>>=20
>>>=20
>>>=20
>>> On 06.04.2018 11:11, Michal Skrivanek wrote:
>>>>> On 4 Apr 2018, at 15:36, Daniel Menzel =
<daniel.menzel(a)hhi.fraunhofer.de> =
<mailto:daniel.menzel@hhi.fraunhofer.de> wrote:
>>>>>=20
>>>>> Hello,
>>>>>=20
>>>>> we're successfully using a setup with 4 Nodes and a replicated =
Gluster for storage. The engine is self hosted. What we're dealing with =
at the moment is the high availability: If a node fails (for example =
simulated by a forced power loss) the engine comes back up online =
withing ~2min. But guests (having the HA option enabled) come back =
online only after a very long grace time of ~5min. As we have a reliable =
network (40 GbE) and reliable servers I think that the default grace =
times are way too high for us - is there any possibility to change those =
values?
>>>> And do you have Power Management(iLO, iDRAC,etc)
configured for =
your hosts? Otherwise we have to resort to relatively long timeouts
to =
make sure the host is really dead
>>>> Thanks,
>>>> michal
>>>>> Thanks in advance!
>>>>> Daniel
>>>>>=20
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users(a)ovirt.org <mailto:Users@ovirt.org>
>>>>>
http://lists.ovirt.org/mailman/listinfo/users =
<
http://lists.ovirt.org/mailman/listinfo/users>
>>>>>>=20
>>>>
>=20
>=20
>=20
>=20
>> _______________________________________________
>> Users mailing list
>> Users(a)ovirt.org <mailto:Users@ovirt.org>
>>
http://lists.ovirt.org/mailman/listinfo/users =
<
http://lists.ovirt.org/mailman/listinfo/users>
=20
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
--Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=utf-8
<html><head><meta http-equiv=3D"Content-Type"
content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;"
class=3D""><br =
class=3D""><div><br class=3D""><blockquote
type=3D"cite" class=3D""><div =
class=3D"">On 12 Apr 2018, at 13:13, Daniel Menzel <<a =
href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" =
class=3D"">daniel.menzel(a)hhi.fraunhofer.de</a>&gt;
wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D"">
=20
<meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8" class=3D"">
=20
<div text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D""><p class=3D"">Hi =
there,</p><p class=3D"">does anyone have an idea how to decrease a
=
virtual machine's
downtime?</p><p class=3D"">Best<br
class=3D"">
Daniel<br class=3D"">
</p>
<br class=3D"">
<div class=3D"moz-cite-prefix">On 06.04.2018 13:34, Daniel Menzel
wrote:<br class=3D"">
</div>
<blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D"">
<meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8" class=3D""><p class=3D"">Hi
Michal,</p><div =
class=3D""><br =
class=3D""></div></blockquote></div></div></blockquote><div><br
=
class=3D""></div>Hi Daniel,</div><div>adding Martin to
review fencing =
behavior<br class=3D""><blockquote type=3D"cite"
class=3D""><div =
class=3D""><div text=3D"#000000" bgcolor=3D"#FFFFFF"
=
class=3D""><blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><p class=3D"">(sorry for misspelling your name in
my first =
mail).</p><div class=3D""><br =
class=3D""></div></blockquote></div></div></blockquote><br
=
class=3D"">that=E2=80=99s not the reason I=E2=80=99m replying =
late!:-))</div><div><br class=3D""><blockquote
type=3D"cite" =
class=3D""><div class=3D""><div text=3D"#000000"
bgcolor=3D"#FFFFFF" =
class=3D""><blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><p class=3D"">The settings for the VMs are the
following =
(oVirt 4.2):</p>
<ol class=3D"">
<li class=3D"">HA checkbox enabled of course</li>
<li class=3D"">"Target Storage Domain for VM Lease"
-> left =
empty</li></ol></blockquote></div></div></blockquote><div><br
=
class=3D""></div>if you need faster reactions then try to use VM Leases
=
as well, it won=E2=80=99t make a difference in this case but will help =
in case of network issues. E.g. if you use iSCSI and the storage =
connection breaks while host connection still works it would restart the =
VM in about 80s; otherwise it would take >5 mins. <br =
class=3D""><blockquote type=3D"cite"
class=3D""><div class=3D""><div =
text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D""><blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><ol class=3D"" start=3D"3">
<li class=3D"">"Resume Behavior" ->
AUTO_RESUME</li>
<li class=3D"">Priority for Migration -> High<br
class=3D"">
</li>
<li class=3D"">"Watchdog Model" ->
No-Watchdog</li>
</ol><p class=3D"">For testing we did not kill any VM but the
=
host. So basically
we simulated an instantaneous crash by manually turning the
machine off via IPMI-Interface (not via operating system!) and
ping the guest(s). What happens then?</p>
<ol class=3D"">
<li class=3D"">2-3 seconds after the we press the host's =
shutdown button we
lose ping contact to the VM(s).</li>
<li class=3D"">After another 20s oVirt changes the host's
status =
to
"connecting", the VM's status is set to a question
mark.</li>
<li class=3D"">After ~1:30 the host is flagged to "non =
responsive=E2=80=9D</li></ol></blockquote></div></div></blockquote><div><b=
r class=3D""></div>that sounds about right. Now fencing action should
=
have been initiated, if you can share the engine logs we can confirm =
that. IIRC we first try soft fencing - try to ssh to that host, that =
might take some time to time out I guess. Martin?<br =
class=3D""><blockquote type=3D"cite"
class=3D""><div class=3D""><div =
text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D""><blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><ol class=3D"" start=3D"3"><li
class=3D"">
</li>
<li class=3D"">After ~2:10 the host's reboot is initiated by
=
oVirt, 5-10s
later the guest is back online.</li>
</ol><p class=3D"">So, there seems to be one mistake I made in
the =
first mail: The
downtime is "only" 2.5min. But still I think this time can be
decreased as for some services it is still quite a long =
time.</p><div class=3D""><br =
class=3D""></div></blockquote></div></div></blockquote><div><br
=
class=3D""></div>these values can be tuned down, but then you may be =
more susceptible to fencing power cycling a host in case of shorter =
network outages. It may be ok=E2=80=A6depending on your requirements.<br =
class=3D""><blockquote type=3D"cite"
class=3D""><div class=3D""><div =
text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D""><blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><p class=3D"">Best<br class=3D"">
Daniel<br class=3D"">
</p>
<br class=3D"">
<div class=3D"moz-cite-prefix">On 06.04.2018 12:49, Michal =
Skrivanek
wrote:<br class=3D"">
</div>
<blockquote type=3D"cite" =
cite=3D"mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com"
class=3D"">
<pre wrap=3D"" class=3D""></pre>
<blockquote type=3D"cite" class=3D"">
<pre wrap=3D"" class=3D"">On 6 Apr 2018, at 12:45,
Daniel =
Menzel <a class=3D"moz-txt-link-rfc2396E" =
href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" =
moz-do-not-send=3D"true">&lt;daniel.menzel(a)hhi.fraunhofer.de&gt;</a>
=
wrote:
Hi Michael,
thanks for your mail. Sorry, I forgot to write that. Yes, we have power =
management and fencing enabled on all hosts. We also tested this and =
found out that it works perfectly. So this cannot be the reason I guess.
</pre>
</blockquote>
<pre wrap=3D"" class=3D"">Hi Daniel,
ok, then it=E2=80=99s worth looking into details. Can you describe in =
more detail what happens? What exact settings you=E2=80=99re using for =
such VM? Are you killing the HE VM or other VMs or both? Would be good =
to narrow it down a bit and then review the exact flow
Thanks,
michal
</pre>
<blockquote type=3D"cite" class=3D"">
<pre wrap=3D"" class=3D"">Daniel
On 06.04.2018 11:11, Michal Skrivanek wrote:
</pre>
<blockquote type=3D"cite" class=3D"">
<blockquote type=3D"cite" class=3D"">
<pre wrap=3D"" class=3D"">On 4 Apr 2018, at 15:36,
Daniel =
Menzel <a class=3D"moz-txt-link-rfc2396E" =
href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" =
moz-do-not-send=3D"true">&lt;daniel.menzel(a)hhi.fraunhofer.de&gt;</a>
=
wrote:
Hello,
we're successfully using a setup with 4 Nodes and a replicated Gluster =
for storage. The engine is self hosted. What we're dealing with at the =
moment is the high availability: If a node fails (for example simulated =
by a forced power loss) the engine comes back up online withing ~2min. =
But guests (having the HA option enabled) come back online only after a =
very long grace time of ~5min. As we have a reliable network (40 GbE) =
and reliable servers I think that the default grace times are way too =
high for us - is there any possibility to change those values?
</pre>
</blockquote>
<pre wrap=3D"" class=3D"">And do you have Power =
Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have =
to resort to relatively long timeouts to make sure the host is really =
dead
Thanks,
michal
</pre>
<blockquote type=3D"cite" class=3D"">
<pre wrap=3D"" class=3D"">Thanks in advance!
Daniel
_______________________________________________
Users mailing list
<a class=3D"moz-txt-link-abbreviated"
href=3D"mailto:Users@ovirt.org" =
moz-do-not-send=3D"true">Users(a)ovirt.org</a>
<a class=3D"moz-txt-link-freetext" =
href=3D"http://lists.ovirt.org/mailman/listinfo/users" =
moz-do-not-send=3D"true">http://lists.ovirt.org/mailman/list...
</pre>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<br class=3D"">
<br class=3D"">
<fieldset class=3D"mimeAttachmentHeader"></fieldset>
<br class=3D"">
<pre wrap=3D"" =
class=3D"">_______________________________________________
Users mailing list
<a class=3D"moz-txt-link-abbreviated" =
href=3D"mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class=3D"moz-txt-link-freetext" =
href=3D"http://lists.ovirt.org/mailman/listinfo/users">http:...
org/mailman/listinfo/users</a>
</pre>
</blockquote>
<br class=3D"">
</div>
_______________________________________________<br class=3D"">Users =
mailing list<br class=3D""><a href=3D"mailto:Users@ovirt.org"
=
class=3D"">Users(a)ovirt.org</a><br =
class=3D"">http://lists.ovirt.org/mailman/listinfo/users<br =
class=3D""></div></blockquote></div><br
class=3D""></body></html>=
--Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7--