
--Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8
On 12 Apr 2018, at 13:13, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> wrote: =20 Hi there, =20 does anyone have an idea how to decrease a virtual machine's downtime? =20 Best Daniel =20 On 06.04.2018 13:34, Daniel Menzel wrote:
Hi Michal, =20 =20
Hi Daniel, adding Martin to review fencing behavior
(sorry for misspelling your name in my first mail). =20 =20
that=E2=80=99s not the reason I=E2=80=99m replying late!:-))
The settings for the VMs are the following (oVirt 4.2): =20 HA checkbox enabled of course "Target Storage Domain for VM Lease" -> left empty
"Resume Behavior" -> AUTO_RESUME Priority for Migration -> High "Watchdog Model" -> No-Watchdog For testing we did not kill any VM but the host. So basically we = simulated an instantaneous crash by manually turning the machine off via = IPMI-Interface (not via operating system!) and ping the guest(s). What = happens then? =20 2-3 seconds after the we press the host's shutdown button we lose =
After another 20s oVirt changes the host's status to "connecting", =
if you need faster reactions then try to use VM Leases as well, it = won=E2=80=99t make a difference in this case but will help in case of = network issues. E.g. if you use iSCSI and the storage connection breaks = while host connection still works it would restart the VM in about 80s; = otherwise it would take >5 mins.=20 ping contact to the VM(s). the VM's status is set to a question mark.
After ~1:30 the host is flagged to "non responsive=E2=80=9D
that sounds about right. Now fencing action should have been initiated, = if you can share the engine logs we can confirm that. IIRC we first try = soft fencing - try to ssh to that host, that might take some time to = time out I guess. Martin?
After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the = guest is back online. So, there seems to be one mistake I made in the first mail: The = downtime is "only" 2.5min. But still I think this time can be decreased = as for some services it is still quite a long time. =20 =20
Best Daniel =20 On 06.04.2018 12:49, Michal Skrivanek wrote:
On 6 Apr 2018, at 12:45, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> = <mailto:daniel.menzel@hhi.fraunhofer.de> wrote: =20 Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have =
these values can be tuned down, but then you may be more susceptible to = fencing power cycling a host in case of shorter network outages. It may = be ok=E2=80=A6depending on your requirements. power management and fencing enabled on all hosts. We also tested this = and found out that it works perfectly. So this cannot be the reason I = guess.
Hi Daniel, ok, then it=E2=80=99s worth looking into details. Can you describe = in more detail what happens? What exact settings you=E2=80=99re using = for such VM? Are you killing the HE VM or other VMs or both? Would be = good to narrow it down a bit and then review the exact flow =20 Thanks, michal =20
Daniel =20 =20 =20 On 06.04.2018 11:11, Michal Skrivanek wrote:
On 4 Apr 2018, at 15:36, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> = <mailto:daniel.menzel@hhi.fraunhofer.de> wrote: =20 Hello, =20 we're successfully using a setup with 4 Nodes and a replicated = Gluster for storage. The engine is self hosted. What we're dealing with = at the moment is the high availability: If a node fails (for example = simulated by a forced power loss) the engine comes back up online = withing ~2min. But guests (having the HA option enabled) come back = online only after a very long grace time of ~5min. As we have a reliable = network (40 GbE) and reliable servers I think that the default grace = times are way too high for us - is there any possibility to change those = values? And do you have Power Management(iLO, iDRAC,etc) configured for = your hosts? Otherwise we have to resort to relatively long timeouts to = make sure the host is really dead Thanks, michal Thanks in advance! Daniel =20 _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20 =20 =20 =20 =20
Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20
Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""><br = class=3D""><div><br class=3D""><blockquote type=3D"cite" class=3D""><div = class=3D"">On 12 Apr 2018, at 13:13, Daniel Menzel <<a = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = class=3D"">daniel.menzel@hhi.fraunhofer.de</a>> wrote:</div><br = class=3D"Apple-interchange-newline"><div class=3D""> =20 <meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8" class=3D""> =20 <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><p class=3D"">Hi = there,</p><p class=3D"">does anyone have an idea how to decrease a = virtual machine's downtime?</p><p class=3D"">Best<br class=3D""> Daniel<br class=3D""> </p> <br class=3D""> <div class=3D"moz-cite-prefix">On 06.04.2018 13:34, Daniel Menzel wrote:<br class=3D""> </div> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""> <meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8" class=3D""><p class=3D"">Hi Michal,</p><div = class=3D""><br = class=3D""></div></blockquote></div></div></blockquote><div><br = class=3D""></div>Hi Daniel,</div><div>adding Martin to review fencing = behavior<br class=3D""><blockquote type=3D"cite" class=3D""><div = class=3D""><div text=3D"#000000" bgcolor=3D"#FFFFFF" = class=3D""><blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">(sorry for misspelling your name in my first = mail).</p><div class=3D""><br = class=3D""></div></blockquote></div></div></blockquote><br = class=3D"">that=E2=80=99s not the reason I=E2=80=99m replying = late!:-))</div><div><br class=3D""><blockquote type=3D"cite" = class=3D""><div class=3D""><div text=3D"#000000" bgcolor=3D"#FFFFFF" = class=3D""><blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">The settings for the VMs are the following = (oVirt 4.2):</p> <ol class=3D""> <li class=3D"">HA checkbox enabled of course</li> <li class=3D"">"Target Storage Domain for VM Lease" -> left = empty</li></ol></blockquote></div></div></blockquote><div><br = class=3D""></div>if you need faster reactions then try to use VM Leases = as well, it won=E2=80=99t make a difference in this case but will help = in case of network issues. E.g. if you use iSCSI and the storage = connection breaks while host connection still works it would restart the = VM in about 80s; otherwise it would take >5 mins. <br = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div = text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><ol class=3D"" start=3D"3"> <li class=3D"">"Resume Behavior" -> AUTO_RESUME</li> <li class=3D"">Priority for Migration -> High<br class=3D""> </li> <li class=3D"">"Watchdog Model" -> No-Watchdog</li> </ol><p class=3D"">For testing we did not kill any VM but the = host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then?</p> <ol class=3D""> <li class=3D"">2-3 seconds after the we press the host's = shutdown button we lose ping contact to the VM(s).</li> <li class=3D"">After another 20s oVirt changes the host's status = to "connecting", the VM's status is set to a question mark.</li> <li class=3D"">After ~1:30 the host is flagged to "non = responsive=E2=80=9D</li></ol></blockquote></div></div></blockquote><div><b= r class=3D""></div>that sounds about right. Now fencing action should = have been initiated, if you can share the engine logs we can confirm = that. IIRC we first try soft fencing - try to ssh to that host, that = might take some time to time out I guess. Martin?<br = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div = text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><ol class=3D"" start=3D"3"><li class=3D""> </li> <li class=3D"">After ~2:10 the host's reboot is initiated by = oVirt, 5-10s later the guest is back online.</li> </ol><p class=3D"">So, there seems to be one mistake I made in the = first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long = time.</p><div class=3D""><br = class=3D""></div></blockquote></div></div></blockquote><div><br = class=3D""></div>these values can be tuned down, but then you may be = more susceptible to fencing power cycling a host in case of shorter = network outages. It may be ok=E2=80=A6depending on your requirements.<br = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div = text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">Best<br class=3D""> Daniel<br class=3D""> </p> <br class=3D""> <div class=3D"moz-cite-prefix">On 06.04.2018 12:49, Michal = Skrivanek wrote:<br class=3D""> </div> <blockquote type=3D"cite" = cite=3D"mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com" class=3D""> <pre wrap=3D"" class=3D""></pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">On 6 Apr 2018, at 12:45, Daniel = Menzel <a class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = moz-do-not-send=3D"true"><daniel.menzel@hhi.fraunhofer.de></a> = wrote: Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power = management and fencing enabled on all hosts. We also tested this and = found out that it works perfectly. So this cannot be the reason I guess. </pre> </blockquote> <pre wrap=3D"" class=3D"">Hi Daniel, ok, then it=E2=80=99s worth looking into details. Can you describe in = more detail what happens? What exact settings you=E2=80=99re using for = such VM? Are you killing the HE VM or other VMs or both? Would be good = to narrow it down a bit and then review the exact flow Thanks, michal </pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">Daniel On 06.04.2018 11:11, Michal Skrivanek wrote: </pre> <blockquote type=3D"cite" class=3D""> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">On 4 Apr 2018, at 15:36, Daniel = Menzel <a class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = moz-do-not-send=3D"true"><daniel.menzel@hhi.fraunhofer.de></a> = wrote: Hello, we're successfully using a setup with 4 Nodes and a replicated Gluster = for storage. The engine is self hosted. What we're dealing with at the = moment is the high availability: If a node fails (for example simulated = by a forced power loss) the engine comes back up online withing ~2min. = But guests (having the HA option enabled) come back online only after a = very long grace time of ~5min. As we have a reliable network (40 GbE) = and reliable servers I think that the default grace times are way too = high for us - is there any possibility to change those values? </pre> </blockquote> <pre wrap=3D"" class=3D"">And do you have Power = Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have = to resort to relatively long timeouts to make sure the host is really = dead Thanks, michal </pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">Thanks in advance! Daniel _______________________________________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" href=3D"mailto:Users@ovirt.org" = moz-do-not-send=3D"true">Users@ovirt.org</a> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users" = moz-do-not-send=3D"true">http://lists.ovirt.org/mailman/listinfo/users</a>= </pre> </blockquote> </blockquote> </blockquote> </blockquote> <br class=3D""> <br class=3D""> <fieldset class=3D"mimeAttachmentHeader"></fieldset> <br class=3D""> <pre wrap=3D"" = class=3D"">_______________________________________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" = href=3D"mailto:Users@ovirt.org">Users@ovirt.org</a> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.= org/mailman/listinfo/users</a> </pre> </blockquote> <br class=3D""> </div> _______________________________________________<br class=3D"">Users = mailing list<br class=3D""><a href=3D"mailto:Users@ovirt.org" = class=3D"">Users@ovirt.org</a><br = class=3D"">http://lists.ovirt.org/mailman/listinfo/users<br = class=3D""></div></blockquote></div><br class=3D""></body></html>= --Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7--