--Apple-Mail=_55FDCE7C-F8D0-4565-BEE6-931F355AA1E1
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=utf-8
On 23 Apr 2018, at 10:52, Daniel Menzel =
<daniel.menzel(a)hhi.fraunhofer.de> wrote:
=20
Hi Michal,
=20
in your last mail you wrote, that the values can be turned down - how =
can this be
done?
=20
=20
this is not anything we change very often as it then decreases the =
system=E2=80=99s tolerance to short network glitches
You=E2=80=99d have to take a look at vdc_options and play with some of =
those parameters=E2=80=A6Martin/Eli may have some suggestions, otherwise =
you=E2=80=99d have to read the source code and experiment
Best
Daniel
=20
On 12.04.2018 20:29, Michal Skrivanek wrote:
>=20
>=20
>> On 12 Apr 2018, at 13:13, Daniel Menzel =
<daniel.menzel(a)hhi.fraunhofer.de =
<mailto:daniel.menzel@hhi.fraunhofer.de>> wrote:
>>=20
>> Hi there,
>>=20
>> does anyone have an idea how to decrease a virtual machine's =
downtime?
>>=20
>> Best
>> Daniel
>>=20
>> On 06.04.2018 13:34, Daniel Menzel wrote:
>>> Hi Michal,
>>>=20
>
>=20
=20
> Hi Daniel,
> adding Martin to review fencing behavior
>>> (sorry for misspelling your name in my first mail).
>>>=20
>
>=20
=20
> that=E2=80=99s not the reason I=E2=80=99m replying late!:-))
>=20
>>> The settings for the VMs are the following (oVirt 4.2):
>>>=20
>>> HA checkbox enabled of course
>>> "Target Storage Domain for VM Lease" -> left empty
>=20
> if you need faster reactions then try to use VM Leases as well, it =
won=E2=80=99t make a difference in this case but will help in case of =
network issues. E.g. if you use iSCSI and the storage connection breaks =
while host connection still works it would restart the VM in about 80s; =
otherwise it would take >5 mins.=20
>>> "Resume Behavior" -> AUTO_RESUME
>>> Priority for Migration -> High
>>> "Watchdog Model" -> No-Watchdog
>>> For testing we did not kill any VM but the host. So basically we =
simulated an instantaneous crash by manually turning the machine off via =
IPMI-Interface (not via operating system!) and ping the guest(s). What =
happens then?
>>>=20
>>> 2-3 seconds after the we press the host's shutdown button we lose =
ping contact to the VM(s).
>>> After another 20s oVirt changes the host's status to
"connecting", =
the VM's status is set to a question mark.
>>> After ~1:30 the host is flagged to "non
responsive=E2=80=9D
>=20
> that sounds about right. Now fencing action should have been =
initiated, if
you can share the engine logs we can confirm that. IIRC we =
first try soft fencing - try to ssh to that host, that might take some =
time to time out I guess. Martin?
>>>=20
>>> After ~2:10 the host's reboot is initiated by oVirt, 5-10s later =
the guest is back online.
>>> So, there seems to be one mistake I made in the first
mail: The =
downtime is "only" 2.5min. But still I think this time can be
decreased =
as for some services it is still quite a long time.
>>>=20
>
>=20
=20
> these values can be tuned down, but then you may be more susceptible =
to fencing power cycling a host in case of shorter network outages. It =
may be ok=E2=80=A6depending on your requirements.
>>> Best
>>> Daniel
>>>=20
>>> On 06.04.2018 12:49, Michal Skrivanek wrote:
>>>>> On 6 Apr 2018, at 12:45, Daniel Menzel =
<daniel.menzel(a)hhi.fraunhofer.de> =
<mailto:daniel.menzel@hhi.fraunhofer.de> wrote:
>>>>>=20
>>>>> Hi Michael,
>>>>> thanks for your mail. Sorry, I forgot to write that. Yes, we have =
power management and fencing enabled on all hosts. We also tested this =
and found out that it works perfectly. So this cannot be the reason I =
guess.
>>>> Hi Daniel,
>>>> ok, then it=E2=80=99s worth looking into details. Can you describe =
in more detail what happens? What exact settings you=E2=80=99re using =
for such VM? Are you killing the HE VM or other VMs or both? Would be =
good to narrow it down a bit and then review the exact flow
>>>>=20
>>>> Thanks,
>>>> michal
>>>>=20
>>>>> Daniel
>>>>>=20
>>>>>=20
>>>>>=20
>>>>> On 06.04.2018 11:11, Michal Skrivanek wrote:
>>>>>>> On 4 Apr 2018, at 15:36, Daniel Menzel =
<daniel.menzel(a)hhi.fraunhofer.de> =
<mailto:daniel.menzel@hhi.fraunhofer.de> wrote:
>>>>>>>=20
>>>>>>> Hello,
>>>>>>>=20
>>>>>>> we're successfully using a setup with 4 Nodes and a
replicated =
Gluster for storage. The engine is self hosted. What we're dealing
with =
at the moment is the high availability: If a node fails (for example =
simulated by a forced power loss) the engine comes back up online =
withing ~2min. But guests (having the HA option enabled) come back =
online only after a very long grace time of ~5min. As we have a reliable =
network (40 GbE) and reliable servers I think that the default grace =
times are way too high for us - is there any possibility to change those =
values?
>>>>>> And do you have Power Management(iLO,
iDRAC,etc) configured for =
your hosts? Otherwise we have to resort to relatively
long timeouts to =
make sure the host is really dead
>>>>>> Thanks,
>>>>>> michal
>>>>>>> Thanks in advance!
>>>>>>> Daniel
>>>>>>>=20
>>>>>>> _______________________________________________
>>>>>>> Users mailing list
>>>>>>> Users(a)ovirt.org <mailto:Users@ovirt.org>
>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users =
<
http://lists.ovirt.org/mailman/listinfo/users>
>>>>>>>=20
>>>>>>>=20
>>>=20
>>>=20
>>>=20
>>> _______________________________________________
>>> Users mailing list
>>> Users(a)ovirt.org <mailto:Users@ovirt.org>
>>>
http://lists.ovirt.org/mailman/listinfo/users =
<
http://lists.ovirt.org/mailman/listinfo/users>
>>=20
>> _______________________________________________
>> Users mailing list
>> Users(a)ovirt.org <mailto:Users@ovirt.org>
>>
http://lists.ovirt.org/mailman/listinfo/users =
<
http://lists.ovirt.org/mailman/listinfo/users>
>
=20
=20
--Apple-Mail=_55FDCE7C-F8D0-4565-BEE6-931F355AA1E1
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=utf-8
<html><head><meta http-equiv=3D"Content-Type"
content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;"
class=3D""><br =
class=3D""><div><br class=3D""><blockquote
type=3D"cite" class=3D""><div =
class=3D"">On 23 Apr 2018, at 10:52, Daniel Menzel <<a =
href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" =
class=3D"">daniel.menzel(a)hhi.fraunhofer.de</a>&gt;
wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D"">
=20
<meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8" class=3D"">
=20
<div text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D""><p class=3D"">Hi =
Michal,</p><p class=3D"">in your last mail you wrote, that the
values =
can be turned down -
how can this be done?</p><div class=3D""><br =
class=3D""></div></div></div></blockquote><div><br
class=3D""></div>this =
is not anything we change very often as it then decreases the system=E2=80=
=99s tolerance to short network glitches</div><div>You=E2=80=99d have to =
take a look at vdc_options and play with some of those =
parameters=E2=80=A6Martin/Eli may have some suggestions, otherwise =
you=E2=80=99d have to read the source code and experiment<br =
class=3D""><blockquote type=3D"cite"
class=3D""><div class=3D""><div =
text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><p
class=3D"">Best<br =
class=3D"">
Daniel<br class=3D"">
</p>
<br class=3D"">
<div class=3D"moz-cite-prefix">On 12.04.2018 20:29, Michal Skrivanek
wrote:<br class=3D"">
</div>
<blockquote type=3D"cite" =
cite=3D"mid:66C98419-3DE3-42CA-B03A-45038BFB10F4@redhat.com"
class=3D"">
<meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8" class=3D"">
<br class=3D"">
<div class=3D""><br class=3D"">
<blockquote type=3D"cite" class=3D"">
<div class=3D"">On 12 Apr 2018, at 13:13, Daniel Menzel
<<a =
href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" class=3D"" =
moz-do-not-send=3D"true">daniel.menzel(a)hhi.fraunhofer.de</a>&gt;
wrote:</div>
<br class=3D"Apple-interchange-newline">
<div class=3D"">
<div text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D""><p =
class=3D"">Hi there,</p><p class=3D"">does anyone have
an idea how to =
decrease a
virtual machine's downtime?</p><p
class=3D"">Best<br =
class=3D"">
Daniel<br class=3D"">
</p>
<br class=3D"">
<div class=3D"moz-cite-prefix">On 06.04.2018 13:34, Daniel
Menzel wrote:<br class=3D"">
</div>
<blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><p class=3D"">Hi Michal,</p>
<div class=3D""><br class=3D"">
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=3D""><br class=3D"">
</div>
Hi Daniel,</div>
<div class=3D"">adding Martin to review fencing behavior<br =
class=3D"">
<blockquote type=3D"cite" class=3D"">
<div class=3D"">
<div text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D"">
<blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><p class=3D"">(sorry for misspelling your name in
my first
mail).</p>
<div class=3D""><br class=3D"">
</div>
</blockquote>
</div>
</div>
</blockquote>
<br class=3D"">
that=E2=80=99s not the reason I=E2=80=99m replying =
late!:-))</div>
<div class=3D""><br class=3D"">
<blockquote type=3D"cite" class=3D"">
<div class=3D"">
<div text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D"">
<blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><p class=3D"">The settings for the VMs are the
following
(oVirt 4.2):</p>
<ol class=3D"">
<li class=3D"">HA checkbox enabled of course</li>
<li class=3D"">"Target Storage Domain for VM
Lease"
-> left empty</li>
</ol>
</blockquote>
</div>
</div>
</blockquote>
<div class=3D""><br class=3D"">
</div>
if you need faster reactions then try to use VM Leases as well,
it won=E2=80=99t make a difference in this case but will help in =
case of
network issues. E.g. if you use iSCSI and the storage connection
breaks while host connection still works it would restart the VM
in about 80s; otherwise it would take >5 mins. <br =
class=3D"">
<blockquote type=3D"cite" class=3D"">
<div class=3D"">
<div text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D"">
<blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D"">
<ol class=3D"" start=3D"3">
<li class=3D"">"Resume Behavior" -> =
AUTO_RESUME</li>
<li class=3D"">Priority for Migration ->
High<br =
class=3D"">
</li>
<li class=3D"">"Watchdog Model" ->
No-Watchdog</li>
</ol><p class=3D"">For testing we did not kill any
VM =
but the
host. So basically we simulated an instantaneous crash
by manually turning the machine off via IPMI-Interface
(not via operating system!) and ping the guest(s).
What happens then?</p>
<ol class=3D"">
<li class=3D"">2-3 seconds after the we press the =
host's
shutdown button we lose ping contact to the =
VM(s).</li>
<li class=3D"">After another 20s oVirt changes the
host's status to "connecting", the VM's status is
set to a question mark.</li>
<li class=3D"">After ~1:30 the host is flagged to
"non
responsive=E2=80=9D</li>
</ol>
</blockquote>
</div>
</div>
</blockquote>
<div class=3D""><br class=3D"">
</div>
that sounds about right. Now fencing action should have been
initiated, if you can share the engine logs we can confirm that.
IIRC we first try soft fencing - try to ssh to that host, that
might take some time to time out I guess. Martin?<br class=3D"">
<blockquote type=3D"cite" class=3D"">
<div class=3D"">
<div text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D"">
<blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D"">
<ol class=3D"" start=3D"3">
<li class=3D""> <br class=3D"">
</li>
<li class=3D"">After ~2:10 the host's reboot is
initiated by oVirt, 5-10s later the guest is back
online.</li>
</ol><p class=3D"">So, there seems to be one mistake
I =
made in
the first mail: The downtime is "only" 2.5min. But
still I think this time can be decreased as for some
services it is still quite a long time.</p>
<div class=3D""><br class=3D"">
</div>
</blockquote>
</div>
</div>
</blockquote>
<div class=3D""><br class=3D"">
</div>
these values can be tuned down, but then you may be more
susceptible to fencing power cycling a host in case of shorter
network outages. It may be ok=E2=80=A6depending on your =
requirements.<br class=3D"">
<blockquote type=3D"cite" class=3D"">
<div class=3D"">
<div text=3D"#000000" bgcolor=3D"#FFFFFF"
class=3D"">
<blockquote type=3D"cite" =
cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" =
class=3D""><p class=3D"">Best<br class=3D"">
Daniel<br class=3D"">
</p>
<br class=3D"">
<div class=3D"moz-cite-prefix">On 06.04.2018 12:49, =
Michal
Skrivanek wrote:<br class=3D"">
</div>
<blockquote type=3D"cite" =
cite=3D"mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com"
class=3D"">
<blockquote type=3D"cite" class=3D"">
<pre class=3D"" wrap=3D"">On 6 Apr 2018, at
12:45, =
Daniel Menzel <a class=3D"moz-txt-link-rfc2396E" =
href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" =
moz-do-not-send=3D"true">&lt;daniel.menzel(a)hhi.fraunhofer.de&gt;</a>
=
wrote:
Hi Michael,
thanks for your mail. Sorry, I forgot to write that. Yes, we have power =
management and fencing enabled on all hosts. We also tested this and =
found out that it works perfectly. So this cannot be the reason I guess.
</pre>
</blockquote>
<pre class=3D"" wrap=3D"">Hi Daniel,
ok, then it=E2=80=99s worth looking into details. Can you describe in =
more detail what happens? What exact settings you=E2=80=99re using for =
such VM? Are you killing the HE VM or other VMs or both? Would be good =
to narrow it down a bit and then review the exact flow
Thanks,
michal
</pre>
<blockquote type=3D"cite" class=3D"">
<pre class=3D"" wrap=3D"">Daniel
On 06.04.2018 11:11, Michal Skrivanek wrote:
</pre>
<blockquote type=3D"cite" class=3D"">
<blockquote type=3D"cite" class=3D"">
<pre class=3D"" wrap=3D"">On 4 Apr 2018,
at =
15:36, Daniel Menzel <a class=3D"moz-txt-link-rfc2396E" =
href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" =
moz-do-not-send=3D"true">&lt;daniel.menzel(a)hhi.fraunhofer.de&gt;</a>
=
wrote:
Hello,
we're successfully using a setup with 4 Nodes and a replicated Gluster =
for storage. The engine is self hosted. What we're dealing with at the =
moment is the high availability: If a node fails (for example simulated =
by a forced power loss) the engine comes back up online withing ~2min. =
But guests (having the HA option enabled) come back online only after a =
very long grace time of ~5min. As we have a reliable network (40 GbE) =
and reliable servers I think that the default grace times are way too =
high for us - is there any possibility to change those values?
</pre>
</blockquote>
<pre class=3D"" wrap=3D"">And do you have
Power =
Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have =
to resort to relatively long timeouts to make sure the host is really =
dead
Thanks,
michal
</pre>
<blockquote type=3D"cite" class=3D"">
<pre class=3D"" wrap=3D"">Thanks in
advance!
Daniel
_______________________________________________
Users mailing list
<a class=3D"moz-txt-link-abbreviated"
href=3D"mailto:Users@ovirt.org" =
moz-do-not-send=3D"true">Users(a)ovirt.org</a>
<a class=3D"moz-txt-link-freetext" =
href=3D"http://lists.ovirt.org/mailman/listinfo/users" =
moz-do-not-send=3D"true">http://lists.ovirt.org/mailman/list...
</pre>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<br class=3D"">
<br class=3D"">
<fieldset
class=3D"mimeAttachmentHeader"></fieldset>
<br class=3D"">
<pre class=3D"" =
wrap=3D"">_______________________________________________
Users mailing list
<a class=3D"moz-txt-link-abbreviated"
href=3D"mailto:Users@ovirt.org" =
moz-do-not-send=3D"true">Users(a)ovirt.org</a>
<a class=3D"moz-txt-link-freetext" =
href=3D"http://lists.ovirt.org/mailman/listinfo/users" =
moz-do-not-send=3D"true">http://lists.ovirt.org/mailman/list...
</pre>
</blockquote>
<br class=3D"">
</div>
_______________________________________________<br
class=3D"">=
Users mailing list<br class=3D"">
<a href=3D"mailto:Users@ovirt.org" class=3D"" =
moz-do-not-send=3D"true">Users(a)ovirt.org</a><br
class=3D"">
<a class=3D"moz-txt-link-freetext" =
href=3D"http://lists.ovirt.org/mailman/listinfo/users">http:...
org/mailman/listinfo/users</a><br class=3D"">
</div>
</blockquote>
</div>
<br class=3D"">
</blockquote>
<br class=3D"">
</div>
</div></blockquote></div><br
class=3D""></body></html>=
--Apple-Mail=_55FDCE7C-F8D0-4565-BEE6-931F355AA1E1--