<div dir="ltr"><div class="gmail_default" style="font-size:large"><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Apr 23, 2018 at 8:06 PM, Michal Skrivanek <span dir="ltr"><<a href="mailto:michal.skrivanek@redhat.com" target="_blank">michal.skrivanek@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space"><br><div><span class=""><br><blockquote type="cite"><div>On 23 Apr 2018, at 10:52, Daniel Menzel <<a href="mailto:daniel.menzel@hhi.fraunhofer.de" target="_blank">daniel.menzel@hhi.fraunhofer.<wbr>de</a>> wrote:</div><br class="m_3511486639080399807Apple-interchange-newline"><div>
<div text="#000000" bgcolor="#FFFFFF"><p>Hi Michal,</p><p>in your last mail you wrote, that the values can be turned down -
how can this be done?</p><div><div>H</div></div></div></div></blockquote></span></div></div></blockquote><div><div style="font-size:large;display:inline" class="gmail_default">AFAIK , there is no point in changing fencing vdc_options values in that case (assuming no kdump is configured here ...)<br><br></div><div style="font-size:large;display:inline" class="gmail_default">The Fencing mechanism </div> <div style="font-size:large;display:inline" class="gmail_default">is putting the host in "connecting" state for a grace period that depends on its number of running VMs and if it serves as APM or not<br></div><div style="font-size:large;display:inline" class="gmail_default">While the host became non-responding , we first try to do a soft-fence (restart VDSM via ssh) , this will also take time<br></div><div style="font-size:large;display:inline" class="gmail_default">After that point , if soft-fence is failing , the host will be reboot via the fencing script and the time it takes is totally depending on the host <br></div><div style="font-size:large;display:inline" class="gmail_default">If you have something to look at , it is your host reboot time and try to improve it, if the host will reboot faster, the whole process will take less time ...<br><br></div><div style="font-size:large;display:inline" class="gmail_default">Regards<br><br></div><div style="font-size:large;display:inline" class="gmail_default">Eli <br><br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space"><div><span class=""><blockquote type="cite"><div><div text="#000000" bgcolor="#FFFFFF"><div><div><div style="font-size:large;display:inline" class="gmail_default"></div> <br></div></div></div></div></blockquote></span></div></div></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space"><div><div><div class="h5"><br><blockquote type="cite"><div><div text="#000000" bgcolor="#FFFFFF"><p>Best<br>
Daniel<br>
</p>
<br>
<div class="m_3511486639080399807moz-cite-prefix">On 12.04.2018 20:29, Michal Skrivanek
wrote:<br>
</div>
<blockquote type="cite">
<br>
<div><br>
<blockquote type="cite">
<div>On 12 Apr 2018, at 13:13, Daniel Menzel <<a href="mailto:daniel.menzel@hhi.fraunhofer.de" target="_blank">daniel.menzel@hhi.fraunhofer.<wbr>de</a>>
wrote:</div>
<br class="m_3511486639080399807Apple-interchange-newline">
<div>
<div text="#000000" bgcolor="#FFFFFF"><p>Hi there,</p><p>does anyone have an idea how to decrease a
virtual machine's downtime?</p><p>Best<br>
Daniel<br>
</p>
<br>
<div class="m_3511486639080399807moz-cite-prefix">On 06.04.2018 13:34, Daniel
Menzel wrote:<br>
</div>
<blockquote type="cite"><p>Hi Michal,</p>
<div><br>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div><br>
</div>
Hi Daniel,</div>
<div>adding Martin to review fencing behavior<br>
<blockquote type="cite">
<div>
<div text="#000000" bgcolor="#FFFFFF">
<blockquote type="cite"><p>(sorry for misspelling your name in my first
mail).</p>
<div><br>
</div>
</blockquote>
</div>
</div>
</blockquote>
<br>
that’s not the reason I’m replying late!:-))</div>
<div><br>
<blockquote type="cite">
<div>
<div text="#000000" bgcolor="#FFFFFF">
<blockquote type="cite"><p>The settings for the VMs are the following
(oVirt 4.2):</p>
<ol>
<li>HA checkbox enabled of course</li>
<li>"Target Storage Domain for VM Lease"
-> left empty</li>
</ol>
</blockquote>
</div>
</div>
</blockquote>
<div><br>
</div>
if you need faster reactions then try to use VM Leases as well,
it won’t make a difference in this case but will help in case of
network issues. E.g. if you use iSCSI and the storage connection
breaks while host connection still works it would restart the VM
in about 80s; otherwise it would take >5 mins. <br>
<blockquote type="cite">
<div>
<div text="#000000" bgcolor="#FFFFFF">
<blockquote type="cite">
<ol start="3">
<li>"Resume Behavior" -> AUTO_RESUME</li>
<li>Priority for Migration -> High<br>
</li>
<li>"Watchdog Model" -> No-Watchdog</li>
</ol><p>For testing we did not kill any VM but the
host. So basically we simulated an instantaneous crash
by manually turning the machine off via IPMI-Interface
(not via operating system!) and ping the guest(s).
What happens then?</p>
<ol>
<li>2-3 seconds after the we press the host's
shutdown button we lose ping contact to the VM(s).</li>
<li>After another 20s oVirt changes the
host's status to "connecting", the VM's status is
set to a question mark.</li>
<li>After ~1:30 the host is flagged to "non
responsive”</li>
</ol>
</blockquote>
</div>
</div>
</blockquote>
<div><br>
</div>
that sounds about right. Now fencing action should have been
initiated, if you can share the engine logs we can confirm that.
IIRC we first try soft fencing - try to ssh to that host, that
might take some time to time out I guess. Martin?<br>
<blockquote type="cite">
<div>
<div text="#000000" bgcolor="#FFFFFF">
<blockquote type="cite">
<ol start="3">
<li> <br>
</li>
<li>After ~2:10 the host's reboot is
initiated by oVirt, 5-10s later the guest is back
online.</li>
</ol><p>So, there seems to be one mistake I made in
the first mail: The downtime is "only" 2.5min. But
still I think this time can be decreased as for some
services it is still quite a long time.</p>
<div><br>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div><br>
</div>
these values can be tuned down, but then you may be more
susceptible to fencing power cycling a host in case of shorter
network outages. It may be ok…depending on your requirements.<br>
<blockquote type="cite">
<div>
<div text="#000000" bgcolor="#FFFFFF">
<blockquote type="cite"><p>Best<br>
Daniel<br>
</p>
<br>
<div class="m_3511486639080399807moz-cite-prefix">On 06.04.2018 12:49, Michal
Skrivanek wrote:<br>
</div>
<blockquote type="cite">
<blockquote type="cite">
<pre>On 6 Apr 2018, at 12:45, Daniel Menzel <a class="m_3511486639080399807moz-txt-link-rfc2396E" href="mailto:daniel.menzel@hhi.fraunhofer.de" target="_blank"><daniel.menzel@hhi.fraunhofer.<wbr>de></a> wrote:
Hi Michael,
thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess.
</pre>
</blockquote>
<pre>Hi Daniel,
ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow
Thanks,
michal
</pre>
<blockquote type="cite">
<pre>Daniel
On 06.04.2018 11:11, Michal Skrivanek wrote:
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre>On 4 Apr 2018, at 15:36, Daniel Menzel <a class="m_3511486639080399807moz-txt-link-rfc2396E" href="mailto:daniel.menzel@hhi.fraunhofer.de" target="_blank"><daniel.menzel@hhi.fraunhofer.<wbr>de></a> wrote:
Hello,
we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values?
</pre>
</blockquote>
<pre>And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead
Thanks,
michal
</pre>
<blockquote type="cite">
<pre>Thanks in advance!
Daniel
______________________________<wbr>_________________
Users mailing list
<a class="m_3511486639080399807moz-txt-link-abbreviated" href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a>
<a class="m_3511486639080399807moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/users</a>
</pre>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<br>
<br>
<fieldset class="m_3511486639080399807mimeAttachmentHeader"></fieldset>
<br>
<pre>______________________________<wbr>_________________
Users mailing list
<a class="m_3511486639080399807moz-txt-link-abbreviated" href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a>
<a class="m_3511486639080399807moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/users</a>
</pre>
</blockquote>
<br>
</div>
______________________________<wbr>_________________<br>
Users mailing list<br>
<a href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br>
<a class="m_3511486639080399807moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/users</a><br>
</div>
</blockquote>
</div>
<br>
</blockquote>
<br>
</div>
</div></blockquote></div></div></div><br></div></blockquote></div><br></div></div>