This is a multi-part message in MIME format.
--------------8D37A5EBA80E36EA0FB88279
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Hi Michal,
(sorry for misspelling your name in my first mail).
The settings for the VMs are the following (oVirt 4.2):
1. HA checkbox enabled of course
2. "Target Storage Domain for VM Lease" -> left empty
3. "Resume Behavior" -> AUTO_RESUME
4. Priority for Migration -> High
5. "Watchdog Model" -> No-Watchdog
For testing we did not kill any VM but the host. So basically we
simulated an instantaneous crash by manually turning the machine off via
IPMI-Interface (not via operating system!) and ping the guest(s). What
happens then?
1. 2-3 seconds after the we press the host's shutdown button we lose
ping contact to the VM(s).
2. After another 20s oVirt changes the host's status to "connecting",
the VM's status is set to a question mark.
3. After ~1:30 the host is flagged to "non responsive"
4. After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the
guest is back online.
So, there seems to be one mistake I made in the first mail: The downtime
is "only" 2.5min. But still I think this time can be decreased as for
some services it is still quite a long time.
Best
Daniel
On 06.04.2018 12:49, Michal Skrivanek wrote:
> On 6 Apr 2018, at 12:45, Daniel Menzel <daniel.menzel(a)hhi.fraunhofer.de>
wrote:
>
> Hi Michael,
> thanks for your mail. Sorry, I forgot to write that. Yes, we have power management
and fencing enabled on all hosts. We also tested this and found out that it works
perfectly. So this cannot be the reason I guess.
Hi Daniel,
ok, then it’s worth looking into details. Can you describe in more detail what happens?
What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or
both? Would be good to narrow it down a bit and then review the exact flow
Thanks,
michal
> Daniel
>
>
>
> On 06.04.2018 11:11, Michal Skrivanek wrote:
>>> On 4 Apr 2018, at 15:36, Daniel Menzel
<daniel.menzel(a)hhi.fraunhofer.de> wrote:
>>>
>>> Hello,
>>>
>>> we're successfully using a setup with 4 Nodes and a replicated Gluster
for storage. The engine is self hosted. What we're dealing with at the moment is the
high availability: If a node fails (for example simulated by a forced power loss) the
engine comes back up online withing ~2min. But guests (having the HA option enabled) come
back online only after a very long grace time of ~5min. As we have a reliable network (40
GbE) and reliable servers I think that the default grace times are way too high for us -
is there any possibility to change those values?
>> And do you have Power Management(iLO, iDRAC,etc) configured for your hosts?
Otherwise we have to resort to relatively long timeouts to make sure the host is really
dead
>> Thanks,
>> michal
>>> Thanks in advance!
>>> Daniel
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users(a)ovirt.org
>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
--------------8D37A5EBA80E36EA0FB88279
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hi Michal,</p>
<p>(sorry for misspelling your name in my first mail).</p>
<p>The settings for the VMs are the following (oVirt 4.2):</p>
<ol>
<li>HA checkbox enabled of course</li>
<li>"Target Storage Domain for VM Lease" -> left
empty</li>
<li>"Resume Behavior" -> AUTO_RESUME</li>
<li>Priority for Migration -> High<br>
</li>
<li>"Watchdog Model" -> No-Watchdog</li>
</ol>
<p>For testing we did not kill any VM but the host. So basically we
simulated an instantaneous crash by manually turning the machine
off via IPMI-Interface (not via operating system!) and ping the
guest(s). What happens then?</p>
<ol>
<li>2-3 seconds after the we press the host's shutdown button we
lose ping contact to the VM(s).</li>
<li>After another 20s oVirt changes the host's status to
"connecting", the VM's status is set to a question mark.</li>
<li>After ~1:30 the host is flagged to "non responsive"<br>
</li>
<li>After ~2:10 the host's reboot is initiated by oVirt, 5-10s
later the guest is back online.</li>
</ol>
<p>So, there seems to be one mistake I made in the first mail: The
downtime is "only" 2.5min. But still I think this time can be
decreased as for some services it is still quite a long time.</p>
<p>Best<br>
Daniel<br>
</p>
<br>
<div class="moz-cite-prefix">On 06.04.2018 12:49, Michal Skrivanek
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com">
<pre wrap="">
</pre>
<blockquote type="cite">
<pre wrap="">On 6 Apr 2018, at 12:45, Daniel Menzel <a
class="moz-txt-link-rfc2396E"
href="mailto:daniel.menzel@hhi.fraunhofer.de"><daniel.menzel@hhi.fraunhofer.de></a>
wrote:
Hi Michael,
thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and
fencing enabled on all hosts. We also tested this and found out that it works perfectly.
So this cannot be the reason I guess.
</pre>
</blockquote>
<pre wrap="">
Hi Daniel,
ok, then it’s worth looking into details. Can you describe in more detail what happens?
What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or
both? Would be good to narrow it down a bit and then review the exact flow
Thanks,
michal
</pre>
<blockquote type="cite">
<pre wrap="">
Daniel
On 06.04.2018 11:11, Michal Skrivanek wrote:
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">On 4 Apr 2018, at 15:36, Daniel Menzel <a
class="moz-txt-link-rfc2396E"
href="mailto:daniel.menzel@hhi.fraunhofer.de"><daniel.menzel@hhi.fraunhofer.de></a>
wrote:
Hello,
we're successfully using a setup with 4 Nodes and a replicated Gluster for storage.
The engine is self hosted. What we're dealing with at the moment is the high
availability: If a node fails (for example simulated by a forced power loss) the engine
comes back up online withing ~2min. But guests (having the HA option enabled) come back
online only after a very long grace time of ~5min. As we have a reliable network (40 GbE)
and reliable servers I think that the default grace times are way too high for us - is
there any possibility to change those values?
</pre>
</blockquote>
<pre wrap="">And do you have Power Management(iLO, iDRAC,etc)
configured for your hosts? Otherwise we have to resort to relatively long timeouts to make
sure the host is really dead
Thanks,
michal
</pre>
<blockquote type="cite">
<pre wrap="">
Thanks in advance!
Daniel
_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated"
href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class="moz-txt-link-freetext"
href="http://lists.ovirt.org/mailman/listinfo/users">http://...
</pre>
</blockquote>
</blockquote>
</blockquote>
<pre wrap="">
</pre>
</blockquote>
<br>
</body>
</html>
--------------8D37A5EBA80E36EA0FB88279--