
--Apple-Mail=_55FDCE7C-F8D0-4565-BEE6-931F355AA1E1 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8
On 23 Apr 2018, at 10:52, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> wrote: =20 Hi Michal, =20 in your last mail you wrote, that the values can be turned down - how = can this be done? =20 =20
Best Daniel =20 On 12.04.2018 20:29, Michal Skrivanek wrote:
On 12 Apr 2018, at 13:13, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de = <mailto:daniel.menzel@hhi.fraunhofer.de>> wrote: =20 Hi there, =20 does anyone have an idea how to decrease a virtual machine's = downtime? =20 Best Daniel =20 On 06.04.2018 13:34, Daniel Menzel wrote:
Hi Michal, =20 =20 =20 Hi Daniel, adding Martin to review fencing behavior (sorry for misspelling your name in my first mail). =20 =20 =20
=20 =20 that=E2=80=99s not the reason I=E2=80=99m replying late!:-)) =20
The settings for the VMs are the following (oVirt 4.2): =20 HA checkbox enabled of course "Target Storage Domain for VM Lease" -> left empty =20 if you need faster reactions then try to use VM Leases as well, it = won=E2=80=99t make a difference in this case but will help in case of = network issues. E.g. if you use iSCSI and the storage connection breaks = while host connection still works it would restart the VM in about 80s; = otherwise it would take >5 mins.=20 "Resume Behavior" -> AUTO_RESUME Priority for Migration -> High "Watchdog Model" -> No-Watchdog For testing we did not kill any VM but the host. So basically we = simulated an instantaneous crash by manually turning the machine off via = IPMI-Interface (not via operating system!) and ping the guest(s). What = happens then? =20 2-3 seconds after the we press the host's shutdown button we lose =
After another 20s oVirt changes the host's status to "connecting", =
After ~1:30 the host is flagged to "non responsive=E2=80=9D =20 that sounds about right. Now fencing action should have been = initiated, if you can share the engine logs we can confirm that. IIRC we = first try soft fencing - try to ssh to that host, that might take some = time to time out I guess. Martin? =20 After ~2:10 the host's reboot is initiated by oVirt, 5-10s later =
So, there seems to be one mistake I made in the first mail: The = downtime is "only" 2.5min. But still I think this time can be decreased = as for some services it is still quite a long time. =20 =20 =20 these values can be tuned down, but then you may be more susceptible = to fencing power cycling a host in case of shorter network outages. It = may be ok=E2=80=A6depending on your requirements. Best Daniel =20 On 06.04.2018 12:49, Michal Skrivanek wrote:
On 6 Apr 2018, at 12:45, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> = <mailto:daniel.menzel@hhi.fraunhofer.de> wrote: =20 Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have =
this is not anything we change very often as it then decreases the = system=E2=80=99s tolerance to short network glitches You=E2=80=99d have to take a look at vdc_options and play with some of = those parameters=E2=80=A6Martin/Eli may have some suggestions, otherwise = you=E2=80=99d have to read the source code and experiment ping contact to the VM(s). the VM's status is set to a question mark. the guest is back online. power management and fencing enabled on all hosts. We also tested this = and found out that it works perfectly. So this cannot be the reason I = guess.
Hi Daniel, ok, then it=E2=80=99s worth looking into details. Can you describe = in more detail what happens? What exact settings you=E2=80=99re using = for such VM? Are you killing the HE VM or other VMs or both? Would be = good to narrow it down a bit and then review the exact flow =20 Thanks, michal =20
Daniel =20 =20 =20 On 06.04.2018 11:11, Michal Skrivanek wrote: >> On 4 Apr 2018, at 15:36, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> = <mailto:daniel.menzel@hhi.fraunhofer.de> wrote: >>=20 >> Hello, >>=20 >> we're successfully using a setup with 4 Nodes and a replicated = Gluster for storage. The engine is self hosted. What we're dealing with = at the moment is the high availability: If a node fails (for example = simulated by a forced power loss) the engine comes back up online = withing ~2min. But guests (having the HA option enabled) come back = online only after a very long grace time of ~5min. As we have a reliable = network (40 GbE) and reliable servers I think that the default grace = times are way too high for us - is there any possibility to change those = values? > And do you have Power Management(iLO, iDRAC,etc) configured for = your hosts? Otherwise we have to resort to relatively long timeouts to = make sure the host is really dead > Thanks, > michal >> Thanks in advance! >> Daniel >>=20 >> _______________________________________________ >> Users mailing list >> Users@ovirt.org <mailto:Users@ovirt.org> >> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> >>=20 >>=20 =20 =20 =20
Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20
Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20 =20
--Apple-Mail=_55FDCE7C-F8D0-4565-BEE6-931F355AA1E1 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""><br = class=3D""><div><br class=3D""><blockquote type=3D"cite" class=3D""><div = class=3D"">On 23 Apr 2018, at 10:52, Daniel Menzel <<a = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = class=3D"">daniel.menzel@hhi.fraunhofer.de</a>> wrote:</div><br = class=3D"Apple-interchange-newline"><div class=3D""> =20 <meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8" class=3D""> =20 <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><p class=3D"">Hi = Michal,</p><p class=3D"">in your last mail you wrote, that the values = can be turned down - how can this be done?</p><div class=3D""><br = class=3D""></div></div></div></blockquote><div><br class=3D""></div>this = is not anything we change very often as it then decreases the system=E2=80= =99s tolerance to short network glitches</div><div>You=E2=80=99d have to = take a look at vdc_options and play with some of those = parameters=E2=80=A6Martin/Eli may have some suggestions, otherwise = you=E2=80=99d have to read the source code and experiment<br = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div = text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><p class=3D"">Best<br = class=3D""> Daniel<br class=3D""> </p> <br class=3D""> <div class=3D"moz-cite-prefix">On 12.04.2018 20:29, Michal Skrivanek wrote:<br class=3D""> </div> <blockquote type=3D"cite" = cite=3D"mid:66C98419-3DE3-42CA-B03A-45038BFB10F4@redhat.com" class=3D""> <meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8" class=3D""> <br class=3D""> <div class=3D""><br class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D"">On 12 Apr 2018, at 13:13, Daniel Menzel <<a = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" class=3D"" = moz-do-not-send=3D"true">daniel.menzel@hhi.fraunhofer.de</a>> wrote:</div> <br class=3D"Apple-interchange-newline"> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><p = class=3D"">Hi there,</p><p class=3D"">does anyone have an idea how to = decrease a virtual machine's downtime?</p><p class=3D"">Best<br = class=3D""> Daniel<br class=3D""> </p> <br class=3D""> <div class=3D"moz-cite-prefix">On 06.04.2018 13:34, Daniel Menzel wrote:<br class=3D""> </div> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">Hi Michal,</p> <div class=3D""><br class=3D""> </div> </blockquote> </div> </div> </blockquote> <div class=3D""><br class=3D""> </div> Hi Daniel,</div> <div class=3D"">adding Martin to review fencing behavior<br = class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">(sorry for misspelling your name in my first mail).</p> <div class=3D""><br class=3D""> </div> </blockquote> </div> </div> </blockquote> <br class=3D""> that=E2=80=99s not the reason I=E2=80=99m replying = late!:-))</div> <div class=3D""><br class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">The settings for the VMs are the following (oVirt 4.2):</p> <ol class=3D""> <li class=3D"">HA checkbox enabled of course</li> <li class=3D"">"Target Storage Domain for VM Lease" -> left empty</li> </ol> </blockquote> </div> </div> </blockquote> <div class=3D""><br class=3D""> </div> if you need faster reactions then try to use VM Leases as well, it won=E2=80=99t make a difference in this case but will help in = case of network issues. E.g. if you use iSCSI and the storage connection breaks while host connection still works it would restart the VM in about 80s; otherwise it would take >5 mins. <br = class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""> <ol class=3D"" start=3D"3"> <li class=3D"">"Resume Behavior" -> = AUTO_RESUME</li> <li class=3D"">Priority for Migration -> High<br = class=3D""> </li> <li class=3D"">"Watchdog Model" -> No-Watchdog</li> </ol><p class=3D"">For testing we did not kill any VM = but the host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then?</p> <ol class=3D""> <li class=3D"">2-3 seconds after the we press the = host's shutdown button we lose ping contact to the = VM(s).</li> <li class=3D"">After another 20s oVirt changes the host's status to "connecting", the VM's status is set to a question mark.</li> <li class=3D"">After ~1:30 the host is flagged to "non responsive=E2=80=9D</li> </ol> </blockquote> </div> </div> </blockquote> <div class=3D""><br class=3D""> </div> that sounds about right. Now fencing action should have been initiated, if you can share the engine logs we can confirm that. IIRC we first try soft fencing - try to ssh to that host, that might take some time to time out I guess. Martin?<br class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""> <ol class=3D"" start=3D"3"> <li class=3D""> <br class=3D""> </li> <li class=3D"">After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest is back online.</li> </ol><p class=3D"">So, there seems to be one mistake I = made in the first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long time.</p> <div class=3D""><br class=3D""> </div> </blockquote> </div> </div> </blockquote> <div class=3D""><br class=3D""> </div> these values can be tuned down, but then you may be more susceptible to fencing power cycling a host in case of shorter network outages. It may be ok=E2=80=A6depending on your = requirements.<br class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">Best<br class=3D""> Daniel<br class=3D""> </p> <br class=3D""> <div class=3D"moz-cite-prefix">On 06.04.2018 12:49, = Michal Skrivanek wrote:<br class=3D""> </div> <blockquote type=3D"cite" = cite=3D"mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com" class=3D""> <blockquote type=3D"cite" class=3D""> <pre class=3D"" wrap=3D"">On 6 Apr 2018, at 12:45, = Daniel Menzel <a class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = moz-do-not-send=3D"true"><daniel.menzel@hhi.fraunhofer.de></a> = wrote: Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power = management and fencing enabled on all hosts. We also tested this and = found out that it works perfectly. So this cannot be the reason I guess. </pre> </blockquote> <pre class=3D"" wrap=3D"">Hi Daniel, ok, then it=E2=80=99s worth looking into details. Can you describe in = more detail what happens? What exact settings you=E2=80=99re using for = such VM? Are you killing the HE VM or other VMs or both? Would be good = to narrow it down a bit and then review the exact flow Thanks, michal </pre> <blockquote type=3D"cite" class=3D""> <pre class=3D"" wrap=3D"">Daniel On 06.04.2018 11:11, Michal Skrivanek wrote: </pre> <blockquote type=3D"cite" class=3D""> <blockquote type=3D"cite" class=3D""> <pre class=3D"" wrap=3D"">On 4 Apr 2018, at = 15:36, Daniel Menzel <a class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = moz-do-not-send=3D"true"><daniel.menzel@hhi.fraunhofer.de></a> = wrote: Hello, we're successfully using a setup with 4 Nodes and a replicated Gluster = for storage. The engine is self hosted. What we're dealing with at the = moment is the high availability: If a node fails (for example simulated = by a forced power loss) the engine comes back up online withing ~2min. = But guests (having the HA option enabled) come back online only after a = very long grace time of ~5min. As we have a reliable network (40 GbE) = and reliable servers I think that the default grace times are way too = high for us - is there any possibility to change those values? </pre> </blockquote> <pre class=3D"" wrap=3D"">And do you have Power = Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have = to resort to relatively long timeouts to make sure the host is really = dead Thanks, michal </pre> <blockquote type=3D"cite" class=3D""> <pre class=3D"" wrap=3D"">Thanks in advance! Daniel _______________________________________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" href=3D"mailto:Users@ovirt.org" = moz-do-not-send=3D"true">Users@ovirt.org</a> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users" = moz-do-not-send=3D"true">http://lists.ovirt.org/mailman/listinfo/users</a>= </pre> </blockquote> </blockquote> </blockquote> </blockquote> <br class=3D""> <br class=3D""> <fieldset class=3D"mimeAttachmentHeader"></fieldset> <br class=3D""> <pre class=3D"" = wrap=3D"">_______________________________________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" href=3D"mailto:Users@ovirt.org" = moz-do-not-send=3D"true">Users@ovirt.org</a> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users" = moz-do-not-send=3D"true">http://lists.ovirt.org/mailman/listinfo/users</a>= </pre> </blockquote> <br class=3D""> </div> _______________________________________________<br class=3D"">= Users mailing list<br class=3D""> <a href=3D"mailto:Users@ovirt.org" class=3D"" = moz-do-not-send=3D"true">Users@ovirt.org</a><br class=3D""> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.= org/mailman/listinfo/users</a><br class=3D""> </div> </blockquote> </div> <br class=3D""> </blockquote> <br class=3D""> </div> </div></blockquote></div><br class=3D""></body></html>= --Apple-Mail=_55FDCE7C-F8D0-4565-BEE6-931F355AA1E1--