Decrease downtime for HA

Daniel Menzel

4 Apr 2018 4 Apr '18

9:36 a.m.

Hello, we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values? Thanks in advance! Daniel

Show replies by date

Michal Skrivanek

6 Apr 6 Apr

5:11 a.m.

...

On 4 Apr 2018, at 15:36, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de> wrote:

Hello,

we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values?

And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead Thanks, michal

...

Thanks in advance! Daniel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Daniel Menzel

6:45 a.m.

Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess. Daniel On 06.04.2018 11:11, Michal Skrivanek wrote:

...

...
On 4 Apr 2018, at 15:36, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de> wrote:

Hello,

we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values?

And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead

Thanks, michal

...
Thanks in advance! Daniel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Michal Skrivanek

6:49 a.m.

...

On 6 Apr 2018, at 12:45, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de> wrote:

Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess.

Hi Daniel, ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow Thanks, michal

...

Daniel

On 06.04.2018 11:11, Michal Skrivanek wrote:

...
...
On 4 Apr 2018, at 15:36, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de> wrote:

Hello,

we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values? And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead Thanks, michal

Thanks in advance! Daniel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Daniel Menzel

7:34 a.m.

This is a multi-part message in MIME format. --------------8D37A5EBA80E36EA0FB88279 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Hi Michal, (sorry for misspelling your name in my first mail). The settings for the VMs are the following (oVirt 4.2): 1. HA checkbox enabled of course 2. "Target Storage Domain for VM Lease" -> left empty 3. "Resume Behavior" -> AUTO_RESUME 4. Priority for Migration -> High 5. "Watchdog Model" -> No-Watchdog For testing we did not kill any VM but the host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then? 1. 2-3 seconds after the we press the host's shutdown button we lose ping contact to the VM(s). 2. After another 20s oVirt changes the host's status to "connecting", the VM's status is set to a question mark. 3. After ~1:30 the host is flagged to "non responsive" 4. After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest is back online. So, there seems to be one mistake I made in the first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long time. Best Daniel On 06.04.2018 12:49, Michal Skrivanek wrote:

...

...
On 6 Apr 2018, at 12:45, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de> wrote:

Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess. Hi Daniel, ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow

Thanks, michal

...
Daniel

On 06.04.2018 11:11, Michal Skrivanek wrote:

...
...
On 4 Apr 2018, at 15:36, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de> wrote:

Hello,

we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values? And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead Thanks, michal Thanks in advance! Daniel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

--------------8D37A5EBA80E36EA0FB88279 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body text="#000000" bgcolor="#FFFFFF"> <p>Hi Michal,</p> <p>(sorry for misspelling your name in my first mail).</p> <p>The settings for the VMs are the following (oVirt 4.2):</p> <ol> <li>HA checkbox enabled of course</li> <li>"Target Storage Domain for VM Lease" -> left empty</li> <li>"Resume Behavior" -> AUTO_RESUME</li> <li>Priority for Migration -> High<br> </li> <li>"Watchdog Model" -> No-Watchdog</li> </ol> <p>For testing we did not kill any VM but the host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then?</p> <ol> <li>2-3 seconds after the we press the host's shutdown button we lose ping contact to the VM(s).</li> <li>After another 20s oVirt changes the host's status to "connecting", the VM's status is set to a question mark.</li> <li>After ~1:30 the host is flagged to "non responsive"<br> </li> <li>After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest is back online.</li> </ol> <p>So, there seems to be one mistake I made in the first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long time.</p> <p>Best<br> Daniel<br> </p> <br> <div class="moz-cite-prefix">On 06.04.2018 12:49, Michal Skrivanek wrote:<br> </div> <blockquote type="cite" cite="mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com"> <pre wrap=""> </pre> <blockquote type="cite"> <pre wrap="">On 6 Apr 2018, at 12:45, Daniel Menzel <a class="moz-txt-link-rfc2396E" href="mailto:daniel.menzel@hhi.fraunhofer.de"><daniel.menzel@hhi.fraunhofer.de></a> wrote: Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess. </pre> </blockquote> <pre wrap=""> Hi Daniel, ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow Thanks, michal </pre> <blockquote type="cite"> <pre wrap=""> Daniel On 06.04.2018 11:11, Michal Skrivanek wrote: </pre> <blockquote type="cite"> <blockquote type="cite"> <pre wrap="">On 4 Apr 2018, at 15:36, Daniel Menzel <a class="moz-txt-link-rfc2396E" href="mailto:daniel.menzel@hhi.fraunhofer.de"><daniel.menzel@hhi.fraunhofer.de></a> wrote: Hello, we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values? </pre> </blockquote> <pre wrap="">And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead Thanks, michal </pre> <blockquote type="cite"> <pre wrap=""> Thanks in advance! Daniel _______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> </blockquote> </blockquote> <pre wrap=""> </pre> </blockquote> <br> </body> </html> --------------8D37A5EBA80E36EA0FB88279--

Daniel Menzel

12 Apr 12 Apr

7:13 a.m.

This is a multi-part message in MIME format. --------------A2B4BD2B62C8E7E194809940 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Hi there, does anyone have an idea how to decrease a virtual machine's downtime? Best Daniel On 06.04.2018 13:34, Daniel Menzel wrote:

...

Hi Michal,

(sorry for misspelling your name in my first mail).

The settings for the VMs are the following (oVirt 4.2):

1. HA checkbox enabled of course 2. "Target Storage Domain for VM Lease" -> left empty 3. "Resume Behavior" -> AUTO_RESUME 4. Priority for Migration -> High 5. "Watchdog Model" -> No-Watchdog

For testing we did not kill any VM but the host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then?

1. 2-3 seconds after the we press the host's shutdown button we lose ping contact to the VM(s). 2. After another 20s oVirt changes the host's status to "connecting", the VM's status is set to a question mark. 3. After ~1:30 the host is flagged to "non responsive" 4. After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest is back online.

So, there seems to be one mistake I made in the first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long time.

Best Daniel

On 06.04.2018 12:49, Michal Skrivanek wrote:

...
...
On 6 Apr 2018, at 12:45, Daniel Menzel<daniel.menzel@hhi.fraunhofer.de> wrote:

Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess. Hi Daniel, ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow

Thanks, michal

...
Daniel

On 06.04.2018 11:11, Michal Skrivanek wrote:

...
...
On 4 Apr 2018, at 15:36, Daniel Menzel<daniel.menzel@hhi.fraunhofer.de> wrote:

Hello,

we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values? And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead Thanks, michal Thanks in advance! Daniel

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

--------------A2B4BD2B62C8E7E194809940 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body text="#000000" bgcolor="#FFFFFF"> <p>Hi there,</p> <p>does anyone have an idea how to decrease a virtual machine's downtime?</p> <p>Best<br> Daniel<br> </p> <br> <div class="moz-cite-prefix">On 06.04.2018 13:34, Daniel Menzel wrote:<br> </div> <blockquote type="cite" cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <p>Hi Michal,</p> <p>(sorry for misspelling your name in my first mail).</p> <p>The settings for the VMs are the following (oVirt 4.2):</p> <ol> <li>HA checkbox enabled of course</li> <li>"Target Storage Domain for VM Lease" -> left empty</li> <li>"Resume Behavior" -> AUTO_RESUME</li> <li>Priority for Migration -> High<br> </li> <li>"Watchdog Model" -> No-Watchdog</li> </ol> <p>For testing we did not kill any VM but the host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then?</p> <ol> <li>2-3 seconds after the we press the host's shutdown button we lose ping contact to the VM(s).</li> <li>After another 20s oVirt changes the host's status to "connecting", the VM's status is set to a question mark.</li> <li>After ~1:30 the host is flagged to "non responsive"<br> </li> <li>After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest is back online.</li> </ol> <p>So, there seems to be one mistake I made in the first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long time.</p> <p>Best<br> Daniel<br> </p> <br> <div class="moz-cite-prefix">On 06.04.2018 12:49, Michal Skrivanek wrote:<br> </div> <blockquote type="cite" cite="mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com"> <pre wrap=""> </pre> <blockquote type="cite"> <pre wrap="">On 6 Apr 2018, at 12:45, Daniel Menzel <a class="moz-txt-link-rfc2396E" href="mailto:daniel.menzel@hhi.fraunhofer.de" moz-do-not-send="true"><daniel.menzel@hhi.fraunhofer.de></a> wrote: Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess. </pre> </blockquote> <pre wrap="">Hi Daniel, ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow Thanks, michal </pre> <blockquote type="cite"> <pre wrap="">Daniel On 06.04.2018 11:11, Michal Skrivanek wrote: </pre> <blockquote type="cite"> <blockquote type="cite"> <pre wrap="">On 4 Apr 2018, at 15:36, Daniel Menzel <a class="moz-txt-link-rfc2396E" href="mailto:daniel.menzel@hhi.fraunhofer.de" moz-do-not-send="true"><daniel.menzel@hhi.fraunhofer.de></a> wrote: Hello, we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values? </pre> </blockquote> <pre wrap="">And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead Thanks, michal </pre> <blockquote type="cite"> <pre wrap="">Thanks in advance! Daniel _______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org" moz-do-not-send="true">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users" moz-do-not-send="true">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> </blockquote> </blockquote> </blockquote> <br> <br> <fieldset class="mimeAttachmentHeader"></fieldset> <br> <pre wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <br> </body> </html> --------------A2B4BD2B62C8E7E194809940--

Michal Skrivanek

2:29 p.m.

--Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8

...

On 12 Apr 2018, at 13:13, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> wrote: =20 Hi there, =20 does anyone have an idea how to decrease a virtual machine's downtime? =20 Best Daniel =20 On 06.04.2018 13:34, Daniel Menzel wrote:

...
Hi Michal, =20 =20

Hi Daniel, adding Martin to review fencing behavior

...

...
(sorry for misspelling your name in my first mail). =20 =20

that=E2=80=99s not the reason I=E2=80=99m replying late!:-))

...

...
The settings for the VMs are the following (oVirt 4.2): =20 HA checkbox enabled of course "Target Storage Domain for VM Lease" -> left empty

...

...
"Resume Behavior" -> AUTO_RESUME Priority for Migration -> High "Watchdog Model" -> No-Watchdog For testing we did not kill any VM but the host. So basically we = simulated an instantaneous crash by manually turning the machine off via = IPMI-Interface (not via operating system!) and ping the guest(s). What = happens then? =20 2-3 seconds after the we press the host's shutdown button we lose =

...

...
After another 20s oVirt changes the host's status to "connecting", =

if you need faster reactions then try to use VM Leases as well, it = won=E2=80=99t make a difference in this case but will help in case of = network issues. E.g. if you use iSCSI and the storage connection breaks = while host connection still works it would restart the VM in about 80s; = otherwise it would take >5 mins.=20 ping contact to the VM(s). the VM's status is set to a question mark.

...

...
After ~1:30 the host is flagged to "non responsive=E2=80=9D

that sounds about right. Now fencing action should have been initiated, = if you can share the engine logs we can confirm that. IIRC we first try = soft fencing - try to ssh to that host, that might take some time to = time out I guess. Martin?

...

...
After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the = guest is back online. So, there seems to be one mistake I made in the first mail: The = downtime is "only" 2.5min. But still I think this time can be decreased = as for some services it is still quite a long time. =20 =20

...

...
Best Daniel =20 On 06.04.2018 12:49, Michal Skrivanek wrote:

...
...
On 6 Apr 2018, at 12:45, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> = <mailto:daniel.menzel@hhi.fraunhofer.de> wrote: =20 Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have =

these values can be tuned down, but then you may be more susceptible to = fencing power cycling a host in case of shorter network outages. It may = be ok=E2=80=A6depending on your requirements. power management and fencing enabled on all hosts. We also tested this = and found out that it works perfectly. So this cannot be the reason I = guess.

...

...
...
Hi Daniel, ok, then it=E2=80=99s worth looking into details. Can you describe = in more detail what happens? What exact settings you=E2=80=99re using = for such VM? Are you killing the HE VM or other VMs or both? Would be = good to narrow it down a bit and then review the exact flow =20 Thanks, michal =20

...
Daniel =20 =20 =20 On 06.04.2018 11:11, Michal Skrivanek wrote:

...
...
On 4 Apr 2018, at 15:36, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> = <mailto:daniel.menzel@hhi.fraunhofer.de> wrote: =20 Hello, =20 we're successfully using a setup with 4 Nodes and a replicated = Gluster for storage. The engine is self hosted. What we're dealing with = at the moment is the high availability: If a node fails (for example = simulated by a forced power loss) the engine comes back up online = withing ~2min. But guests (having the HA option enabled) come back = online only after a very long grace time of ~5min. As we have a reliable = network (40 GbE) and reliable servers I think that the default grace = times are way too high for us - is there any possibility to change those = values? And do you have Power Management(iLO, iDRAC,etc) configured for = your hosts? Otherwise we have to resort to relatively long timeouts to = make sure the host is really dead Thanks, michal Thanks in advance! Daniel =20 _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20 =20 =20 =20 =20

Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

--Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""><br = class=3D""><div><br class=3D""><blockquote type=3D"cite" class=3D""><div = class=3D"">On 12 Apr 2018, at 13:13, Daniel Menzel <<a = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = class=3D"">daniel.menzel@hhi.fraunhofer.de</a>> wrote:</div><br = class=3D"Apple-interchange-newline"><div class=3D""> =20 <meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8" class=3D""> =20 <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><p class=3D"">Hi = there,</p><p class=3D"">does anyone have an idea how to decrease a = virtual machine's downtime?</p><p class=3D"">Best<br class=3D""> Daniel<br class=3D""> </p> <br class=3D""> <div class=3D"moz-cite-prefix">On 06.04.2018 13:34, Daniel Menzel wrote:<br class=3D""> </div> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""> <meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8" class=3D""><p class=3D"">Hi Michal,</p><div = class=3D""><br = class=3D""></div></blockquote></div></div></blockquote><div><br = class=3D""></div>Hi Daniel,</div><div>adding Martin to review fencing = behavior<br class=3D""><blockquote type=3D"cite" class=3D""><div = class=3D""><div text=3D"#000000" bgcolor=3D"#FFFFFF" = class=3D""><blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">(sorry for misspelling your name in my first = mail).</p><div class=3D""><br = class=3D""></div></blockquote></div></div></blockquote><br = class=3D"">that=E2=80=99s not the reason I=E2=80=99m replying = late!:-))</div><div><br class=3D""><blockquote type=3D"cite" = class=3D""><div class=3D""><div text=3D"#000000" bgcolor=3D"#FFFFFF" = class=3D""><blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">The settings for the VMs are the following = (oVirt 4.2):</p> <ol class=3D""> <li class=3D"">HA checkbox enabled of course</li> <li class=3D"">"Target Storage Domain for VM Lease" -> left = empty</li></ol></blockquote></div></div></blockquote><div><br = class=3D""></div>if you need faster reactions then try to use VM Leases = as well, it won=E2=80=99t make a difference in this case but will help = in case of network issues. E.g. if you use iSCSI and the storage = connection breaks while host connection still works it would restart the = VM in about 80s; otherwise it would take >5 mins. <br = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div = text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><ol class=3D"" start=3D"3"> <li class=3D"">"Resume Behavior" -> AUTO_RESUME</li> <li class=3D"">Priority for Migration -> High<br class=3D""> </li> <li class=3D"">"Watchdog Model" -> No-Watchdog</li> </ol><p class=3D"">For testing we did not kill any VM but the = host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then?</p> <ol class=3D""> <li class=3D"">2-3 seconds after the we press the host's = shutdown button we lose ping contact to the VM(s).</li> <li class=3D"">After another 20s oVirt changes the host's status = to "connecting", the VM's status is set to a question mark.</li> <li class=3D"">After ~1:30 the host is flagged to "non = responsive=E2=80=9D</li></ol></blockquote></div></div></blockquote><div><b= r class=3D""></div>that sounds about right. Now fencing action should = have been initiated, if you can share the engine logs we can confirm = that. IIRC we first try soft fencing - try to ssh to that host, that = might take some time to time out I guess. Martin?<br = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div = text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><ol class=3D"" start=3D"3"><li class=3D""> </li> <li class=3D"">After ~2:10 the host's reboot is initiated by = oVirt, 5-10s later the guest is back online.</li> </ol><p class=3D"">So, there seems to be one mistake I made in the = first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long = time.</p><div class=3D""><br = class=3D""></div></blockquote></div></div></blockquote><div><br = class=3D""></div>these values can be tuned down, but then you may be = more susceptible to fencing power cycling a host in case of shorter = network outages. It may be ok=E2=80=A6depending on your requirements.<br = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div = text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">Best<br class=3D""> Daniel<br class=3D""> </p> <br class=3D""> <div class=3D"moz-cite-prefix">On 06.04.2018 12:49, Michal = Skrivanek wrote:<br class=3D""> </div> <blockquote type=3D"cite" = cite=3D"mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com" class=3D""> <pre wrap=3D"" class=3D""></pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">On 6 Apr 2018, at 12:45, Daniel = Menzel <a class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = moz-do-not-send=3D"true"><daniel.menzel@hhi.fraunhofer.de></a> = wrote: Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power = management and fencing enabled on all hosts. We also tested this and = found out that it works perfectly. So this cannot be the reason I guess. </pre> </blockquote> <pre wrap=3D"" class=3D"">Hi Daniel, ok, then it=E2=80=99s worth looking into details. Can you describe in = more detail what happens? What exact settings you=E2=80=99re using for = such VM? Are you killing the HE VM or other VMs or both? Would be good = to narrow it down a bit and then review the exact flow Thanks, michal </pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">Daniel On 06.04.2018 11:11, Michal Skrivanek wrote: </pre> <blockquote type=3D"cite" class=3D""> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">On 4 Apr 2018, at 15:36, Daniel = Menzel <a class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = moz-do-not-send=3D"true"><daniel.menzel@hhi.fraunhofer.de></a> = wrote: Hello, we're successfully using a setup with 4 Nodes and a replicated Gluster = for storage. The engine is self hosted. What we're dealing with at the = moment is the high availability: If a node fails (for example simulated = by a forced power loss) the engine comes back up online withing ~2min. = But guests (having the HA option enabled) come back online only after a = very long grace time of ~5min. As we have a reliable network (40 GbE) = and reliable servers I think that the default grace times are way too = high for us - is there any possibility to change those values? </pre> </blockquote> <pre wrap=3D"" class=3D"">And do you have Power = Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have = to resort to relatively long timeouts to make sure the host is really = dead Thanks, michal </pre> <blockquote type=3D"cite" class=3D""> <pre wrap=3D"" class=3D"">Thanks in advance! Daniel _______________________________________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" href=3D"mailto:Users@ovirt.org" = moz-do-not-send=3D"true">Users@ovirt.org</a> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users" = moz-do-not-send=3D"true">http://lists.ovirt.org/mailman/listinfo/users</a>= </pre> </blockquote> </blockquote> </blockquote> </blockquote> <br class=3D""> <br class=3D""> <fieldset class=3D"mimeAttachmentHeader"></fieldset> <br class=3D""> <pre wrap=3D"" = class=3D"">_______________________________________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" = href=3D"mailto:Users@ovirt.org">Users@ovirt.org</a> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.= org/mailman/listinfo/users</a> </pre> </blockquote> <br class=3D""> </div> _______________________________________________<br class=3D"">Users = mailing list<br class=3D""><a href=3D"mailto:Users@ovirt.org" = class=3D"">Users@ovirt.org</a><br = class=3D"">http://lists.ovirt.org/mailman/listinfo/users<br = class=3D""></div></blockquote></div><br class=3D""></body></html>= --Apple-Mail=_30227EB9-EA43-4F62-8A36-DD1FB26157E7--

Daniel Menzel

23 Apr 23 Apr

4:52 a.m.

This is a multi-part message in MIME format. --------------3D697365BAFF211CE5F73438 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Hi Michal, in your last mail you wrote, that the values can be turned down - how can this be done? Best Daniel On 12.04.2018 20:29, Michal Skrivanek wrote:

...

...
On 12 Apr 2018, at 13:13, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de <mailto:daniel.menzel@hhi.fraunhofer.de>> wrote:

Hi there,

does anyone have an idea how to decrease a virtual machine's downtime?

Best Daniel

On 06.04.2018 13:34, Daniel Menzel wrote:

...
Hi Michal,

Hi Daniel, adding Martin to review fencing behavior

...
...
(sorry for misspelling your name in my first mail).

that’s not the reason I’m replying late!:-))

...
...
The settings for the VMs are the following (oVirt 4.2):

1. HA checkbox enabled of course 2. "Target Storage Domain for VM Lease" -> left empty

if you need faster reactions then try to use VM Leases as well, it won’t make a difference in this case but will help in case of network issues. E.g. if you use iSCSI and the storage connection breaks while host connection still works it would restart the VM in about 80s; otherwise it would take >5 mins.

...
...
3. "Resume Behavior" -> AUTO_RESUME 4. Priority for Migration -> High 5. "Watchdog Model" -> No-Watchdog

For testing we did not kill any VM but the host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then?

1. 2-3 seconds after the we press the host's shutdown button we lose ping contact to the VM(s). 2. After another 20s oVirt changes the host's status to "connecting", the VM's status is set to a question mark. 3. After ~1:30 the host is flagged to "non responsive”

that sounds about right. Now fencing action should have been initiated, if you can share the engine logs we can confirm that. IIRC we first try soft fencing - try to ssh to that host, that might take some time to time out I guess. Martin?

...
...
3.

4. After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest is back online.

So, there seems to be one mistake I made in the first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long time.

these values can be tuned down, but then you may be more susceptible to fencing power cycling a host in case of shorter network outages. It may be ok…depending on your requirements.

...
...
Best Daniel

On 06.04.2018 12:49, Michal Skrivanek wrote:

...
...
On 6 Apr 2018, at 12:45, Daniel Menzel<daniel.menzel@hhi.fraunhofer.de> wrote:

Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess. Hi Daniel, ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow

Thanks, michal

...
Daniel

On 06.04.2018 11:11, Michal Skrivanek wrote:

...
> On 4 Apr 2018, at 15:36, Daniel Menzel<daniel.menzel@hhi.fraunhofer.de> wrote: > > Hello, > > we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values? And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead Thanks, michal > Thanks in advance! > Daniel > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > >

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users

--------------3D697365BAFF211CE5F73438 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body text="#000000" bgcolor="#FFFFFF"> <p>Hi Michal,</p> <p>in your last mail you wrote, that the values can be turned down - how can this be done?</p> <p>Best<br> Daniel<br> </p> <br> <div class="moz-cite-prefix">On 12.04.2018 20:29, Michal Skrivanek wrote:<br> </div> <blockquote type="cite" cite="mid:66C98419-3DE3-42CA-B03A-45038BFB10F4@redhat.com"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <br class=""> <div><br class=""> <blockquote type="cite" class=""> <div class="">On 12 Apr 2018, at 13:13, Daniel Menzel <<a href="mailto:daniel.menzel@hhi.fraunhofer.de" class="" moz-do-not-send="true">daniel.menzel@hhi.fraunhofer.de</a>> wrote:</div> <br class="Apple-interchange-newline"> <div class=""> <div text="#000000" bgcolor="#FFFFFF" class=""> <p class="">Hi there,</p> <p class="">does anyone have an idea how to decrease a virtual machine's downtime?</p> <p class="">Best<br class=""> Daniel<br class=""> </p> <br class=""> <div class="moz-cite-prefix">On 06.04.2018 13:34, Daniel Menzel wrote:<br class=""> </div> <blockquote type="cite" cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" class=""> <p class="">Hi Michal,</p> <div class=""><br class=""> </div> </blockquote> </div> </div> </blockquote> <div><br class=""> </div> Hi Daniel,</div> <div>adding Martin to review fencing behavior<br class=""> <blockquote type="cite" class=""> <div class=""> <div text="#000000" bgcolor="#FFFFFF" class=""> <blockquote type="cite" cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" class=""> <p class="">(sorry for misspelling your name in my first mail).</p> <div class=""><br class=""> </div> </blockquote> </div> </div> </blockquote> <br class=""> that’s not the reason I’m replying late!:-))</div> <div><br class=""> <blockquote type="cite" class=""> <div class=""> <div text="#000000" bgcolor="#FFFFFF" class=""> <blockquote type="cite" cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" class=""> <p class="">The settings for the VMs are the following (oVirt 4.2):</p> <ol class=""> <li class="">HA checkbox enabled of course</li> <li class="">"Target Storage Domain for VM Lease" -> left empty</li> </ol> </blockquote> </div> </div> </blockquote> <div><br class=""> </div> if you need faster reactions then try to use VM Leases as well, it won’t make a difference in this case but will help in case of network issues. E.g. if you use iSCSI and the storage connection breaks while host connection still works it would restart the VM in about 80s; otherwise it would take >5 mins. <br class=""> <blockquote type="cite" class=""> <div class=""> <div text="#000000" bgcolor="#FFFFFF" class=""> <blockquote type="cite" cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" class=""> <ol class="" start="3"> <li class="">"Resume Behavior" -> AUTO_RESUME</li> <li class="">Priority for Migration -> High<br class=""> </li> <li class="">"Watchdog Model" -> No-Watchdog</li> </ol> <p class="">For testing we did not kill any VM but the host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then?</p> <ol class=""> <li class="">2-3 seconds after the we press the host's shutdown button we lose ping contact to the VM(s).</li> <li class="">After another 20s oVirt changes the host's status to "connecting", the VM's status is set to a question mark.</li> <li class="">After ~1:30 the host is flagged to "non responsive”</li> </ol> </blockquote> </div> </div> </blockquote> <div><br class=""> </div> that sounds about right. Now fencing action should have been initiated, if you can share the engine logs we can confirm that. IIRC we first try soft fencing - try to ssh to that host, that might take some time to time out I guess. Martin?<br class=""> <blockquote type="cite" class=""> <div class=""> <div text="#000000" bgcolor="#FFFFFF" class=""> <blockquote type="cite" cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" class=""> <ol class="" start="3"> <li class=""> <br> </li> <li class="">After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest is back online.</li> </ol> <p class="">So, there seems to be one mistake I made in the first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long time.</p> <div class=""><br class=""> </div> </blockquote> </div> </div> </blockquote> <div><br class=""> </div> these values can be tuned down, but then you may be more susceptible to fencing power cycling a host in case of shorter network outages. It may be ok…depending on your requirements.<br class=""> <blockquote type="cite" class=""> <div class=""> <div text="#000000" bgcolor="#FFFFFF" class=""> <blockquote type="cite" cite="mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" class=""> <p class="">Best<br class=""> Daniel<br class=""> </p> <br class=""> <div class="moz-cite-prefix">On 06.04.2018 12:49, Michal Skrivanek wrote:<br class=""> </div> <blockquote type="cite" cite="mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com" class=""> <blockquote type="cite" class=""> <pre class="" wrap="">On 6 Apr 2018, at 12:45, Daniel Menzel <a class="moz-txt-link-rfc2396E" href="mailto:daniel.menzel@hhi.fraunhofer.de" moz-do-not-send="true"><daniel.menzel@hhi.fraunhofer.de></a> wrote: Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess. </pre> </blockquote> <pre class="" wrap="">Hi Daniel, ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow Thanks, michal </pre> <blockquote type="cite" class=""> <pre class="" wrap="">Daniel On 06.04.2018 11:11, Michal Skrivanek wrote: </pre> <blockquote type="cite" class=""> <blockquote type="cite" class=""> <pre class="" wrap="">On 4 Apr 2018, at 15:36, Daniel Menzel <a class="moz-txt-link-rfc2396E" href="mailto:daniel.menzel@hhi.fraunhofer.de" moz-do-not-send="true"><daniel.menzel@hhi.fraunhofer.de></a> wrote: Hello, we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values? </pre> </blockquote> <pre class="" wrap="">And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead Thanks, michal </pre> <blockquote type="cite" class=""> <pre class="" wrap="">Thanks in advance! Daniel _______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org" moz-do-not-send="true">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users" moz-do-not-send="true">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> </blockquote> </blockquote> </blockquote> <br class=""> <br class=""> <fieldset class="mimeAttachmentHeader"></fieldset> <br class=""> <pre class="" wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org" moz-do-not-send="true">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users" moz-do-not-send="true">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <br class=""> </div> _______________________________________________<br class=""> Users mailing list<br class=""> <a href="mailto:Users@ovirt.org" class="" moz-do-not-send="true">Users@ovirt.org</a><br class=""> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a><br class=""> </div> </blockquote> </div> <br class=""> </blockquote> <br> </body> </html> --------------3D697365BAFF211CE5F73438--

Michal Skrivanek

1:06 p.m.

--Apple-Mail=_55FDCE7C-F8D0-4565-BEE6-931F355AA1E1 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8

...

On 23 Apr 2018, at 10:52, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> wrote: =20 Hi Michal, =20 in your last mail you wrote, that the values can be turned down - how = can this be done? =20 =20

...

Best Daniel =20 On 12.04.2018 20:29, Michal Skrivanek wrote:

...
...
On 12 Apr 2018, at 13:13, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de = <mailto:daniel.menzel@hhi.fraunhofer.de>> wrote: =20 Hi there, =20 does anyone have an idea how to decrease a virtual machine's = downtime? =20 Best Daniel =20 On 06.04.2018 13:34, Daniel Menzel wrote:

...
Hi Michal, =20 =20 =20 Hi Daniel, adding Martin to review fencing behavior (sorry for misspelling your name in my first mail). =20 =20 =20

=20 =20 that=E2=80=99s not the reason I=E2=80=99m replying late!:-)) =20

...
...
The settings for the VMs are the following (oVirt 4.2): =20 HA checkbox enabled of course "Target Storage Domain for VM Lease" -> left empty =20 if you need faster reactions then try to use VM Leases as well, it = won=E2=80=99t make a difference in this case but will help in case of = network issues. E.g. if you use iSCSI and the storage connection breaks = while host connection still works it would restart the VM in about 80s; = otherwise it would take >5 mins.=20 "Resume Behavior" -> AUTO_RESUME Priority for Migration -> High "Watchdog Model" -> No-Watchdog For testing we did not kill any VM but the host. So basically we = simulated an instantaneous crash by manually turning the machine off via = IPMI-Interface (not via operating system!) and ping the guest(s). What = happens then? =20 2-3 seconds after the we press the host's shutdown button we lose =

...

...
...
...
After another 20s oVirt changes the host's status to "connecting", =

...

...
...
...
After ~1:30 the host is flagged to "non responsive=E2=80=9D =20 that sounds about right. Now fencing action should have been = initiated, if you can share the engine logs we can confirm that. IIRC we = first try soft fencing - try to ssh to that host, that might take some = time to time out I guess. Martin? =20 After ~2:10 the host's reboot is initiated by oVirt, 5-10s later =

...

...
...
...
So, there seems to be one mistake I made in the first mail: The = downtime is "only" 2.5min. But still I think this time can be decreased = as for some services it is still quite a long time. =20 =20 =20 these values can be tuned down, but then you may be more susceptible = to fencing power cycling a host in case of shorter network outages. It = may be ok=E2=80=A6depending on your requirements. Best Daniel =20 On 06.04.2018 12:49, Michal Skrivanek wrote:

...
...
On 6 Apr 2018, at 12:45, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> = <mailto:daniel.menzel@hhi.fraunhofer.de> wrote: =20 Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have =

this is not anything we change very often as it then decreases the = system=E2=80=99s tolerance to short network glitches You=E2=80=99d have to take a look at vdc_options and play with some of = those parameters=E2=80=A6Martin/Eli may have some suggestions, otherwise = you=E2=80=99d have to read the source code and experiment ping contact to the VM(s). the VM's status is set to a question mark. the guest is back online. power management and fencing enabled on all hosts. We also tested this = and found out that it works perfectly. So this cannot be the reason I = guess.

...

...
...
...
...
Hi Daniel, ok, then it=E2=80=99s worth looking into details. Can you describe = in more detail what happens? What exact settings you=E2=80=99re using = for such VM? Are you killing the HE VM or other VMs or both? Would be = good to narrow it down a bit and then review the exact flow =20 Thanks, michal =20

...
Daniel =20 =20 =20 On 06.04.2018 11:11, Michal Skrivanek wrote: >> On 4 Apr 2018, at 15:36, Daniel Menzel = <daniel.menzel@hhi.fraunhofer.de> = <mailto:daniel.menzel@hhi.fraunhofer.de> wrote: >>=20 >> Hello, >>=20 >> we're successfully using a setup with 4 Nodes and a replicated = Gluster for storage. The engine is self hosted. What we're dealing with = at the moment is the high availability: If a node fails (for example = simulated by a forced power loss) the engine comes back up online = withing ~2min. But guests (having the HA option enabled) come back = online only after a very long grace time of ~5min. As we have a reliable = network (40 GbE) and reliable servers I think that the default grace = times are way too high for us - is there any possibility to change those = values? > And do you have Power Management(iLO, iDRAC,etc) configured for = your hosts? Otherwise we have to resort to relatively long timeouts to = make sure the host is really dead > Thanks, > michal >> Thanks in advance! >> Daniel >>=20 >> _______________________________________________ >> Users mailing list >> Users@ovirt.org <mailto:Users@ovirt.org> >> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> >>=20 >>=20 =20 =20 =20

Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20

Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20 =20

--Apple-Mail=_55FDCE7C-F8D0-4565-BEE6-931F355AA1E1 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""><br = class=3D""><div><br class=3D""><blockquote type=3D"cite" class=3D""><div = class=3D"">On 23 Apr 2018, at 10:52, Daniel Menzel <<a = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = class=3D"">daniel.menzel@hhi.fraunhofer.de</a>> wrote:</div><br = class=3D"Apple-interchange-newline"><div class=3D""> =20 <meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8" class=3D""> =20 <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><p class=3D"">Hi = Michal,</p><p class=3D"">in your last mail you wrote, that the values = can be turned down - how can this be done?</p><div class=3D""><br = class=3D""></div></div></div></blockquote><div><br class=3D""></div>this = is not anything we change very often as it then decreases the system=E2=80= =99s tolerance to short network glitches</div><div>You=E2=80=99d have to = take a look at vdc_options and play with some of those = parameters=E2=80=A6Martin/Eli may have some suggestions, otherwise = you=E2=80=99d have to read the source code and experiment<br = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div = text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><p class=3D"">Best<br = class=3D""> Daniel<br class=3D""> </p> <br class=3D""> <div class=3D"moz-cite-prefix">On 12.04.2018 20:29, Michal Skrivanek wrote:<br class=3D""> </div> <blockquote type=3D"cite" = cite=3D"mid:66C98419-3DE3-42CA-B03A-45038BFB10F4@redhat.com" class=3D""> <meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8" class=3D""> <br class=3D""> <div class=3D""><br class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D"">On 12 Apr 2018, at 13:13, Daniel Menzel <<a = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" class=3D"" = moz-do-not-send=3D"true">daniel.menzel@hhi.fraunhofer.de</a>> wrote:</div> <br class=3D"Apple-interchange-newline"> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""><p = class=3D"">Hi there,</p><p class=3D"">does anyone have an idea how to = decrease a virtual machine's downtime?</p><p class=3D"">Best<br = class=3D""> Daniel<br class=3D""> </p> <br class=3D""> <div class=3D"moz-cite-prefix">On 06.04.2018 13:34, Daniel Menzel wrote:<br class=3D""> </div> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">Hi Michal,</p> <div class=3D""><br class=3D""> </div> </blockquote> </div> </div> </blockquote> <div class=3D""><br class=3D""> </div> Hi Daniel,</div> <div class=3D"">adding Martin to review fencing behavior<br = class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">(sorry for misspelling your name in my first mail).</p> <div class=3D""><br class=3D""> </div> </blockquote> </div> </div> </blockquote> <br class=3D""> that=E2=80=99s not the reason I=E2=80=99m replying = late!:-))</div> <div class=3D""><br class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">The settings for the VMs are the following (oVirt 4.2):</p> <ol class=3D""> <li class=3D"">HA checkbox enabled of course</li> <li class=3D"">"Target Storage Domain for VM Lease" -> left empty</li> </ol> </blockquote> </div> </div> </blockquote> <div class=3D""><br class=3D""> </div> if you need faster reactions then try to use VM Leases as well, it won=E2=80=99t make a difference in this case but will help in = case of network issues. E.g. if you use iSCSI and the storage connection breaks while host connection still works it would restart the VM in about 80s; otherwise it would take >5 mins. <br = class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""> <ol class=3D"" start=3D"3"> <li class=3D"">"Resume Behavior" -> = AUTO_RESUME</li> <li class=3D"">Priority for Migration -> High<br = class=3D""> </li> <li class=3D"">"Watchdog Model" -> No-Watchdog</li> </ol><p class=3D"">For testing we did not kill any VM = but the host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then?</p> <ol class=3D""> <li class=3D"">2-3 seconds after the we press the = host's shutdown button we lose ping contact to the = VM(s).</li> <li class=3D"">After another 20s oVirt changes the host's status to "connecting", the VM's status is set to a question mark.</li> <li class=3D"">After ~1:30 the host is flagged to "non responsive=E2=80=9D</li> </ol> </blockquote> </div> </div> </blockquote> <div class=3D""><br class=3D""> </div> that sounds about right. Now fencing action should have been initiated, if you can share the engine logs we can confirm that. IIRC we first try soft fencing - try to ssh to that host, that might take some time to time out I guess. Martin?<br class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""> <ol class=3D"" start=3D"3"> <li class=3D""> <br class=3D""> </li> <li class=3D"">After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest is back online.</li> </ol><p class=3D"">So, there seems to be one mistake I = made in the first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long time.</p> <div class=3D""><br class=3D""> </div> </blockquote> </div> </div> </blockquote> <div class=3D""><br class=3D""> </div> these values can be tuned down, but then you may be more susceptible to fencing power cycling a host in case of shorter network outages. It may be ok=E2=80=A6depending on your = requirements.<br class=3D""> <blockquote type=3D"cite" class=3D""> <div class=3D""> <div text=3D"#000000" bgcolor=3D"#FFFFFF" class=3D""> <blockquote type=3D"cite" = cite=3D"mid:1c7f3633-258f-0365-443e-6389b77c7ad4@hhi.fraunhofer.de" = class=3D""><p class=3D"">Best<br class=3D""> Daniel<br class=3D""> </p> <br class=3D""> <div class=3D"moz-cite-prefix">On 06.04.2018 12:49, = Michal Skrivanek wrote:<br class=3D""> </div> <blockquote type=3D"cite" = cite=3D"mid:585D25A6-78B5-4416-BA44-7BFE91869077@redhat.com" class=3D""> <blockquote type=3D"cite" class=3D""> <pre class=3D"" wrap=3D"">On 6 Apr 2018, at 12:45, = Daniel Menzel <a class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = moz-do-not-send=3D"true"><daniel.menzel@hhi.fraunhofer.de></a> = wrote: Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power = management and fencing enabled on all hosts. We also tested this and = found out that it works perfectly. So this cannot be the reason I guess. </pre> </blockquote> <pre class=3D"" wrap=3D"">Hi Daniel, ok, then it=E2=80=99s worth looking into details. Can you describe in = more detail what happens? What exact settings you=E2=80=99re using for = such VM? Are you killing the HE VM or other VMs or both? Would be good = to narrow it down a bit and then review the exact flow Thanks, michal </pre> <blockquote type=3D"cite" class=3D""> <pre class=3D"" wrap=3D"">Daniel On 06.04.2018 11:11, Michal Skrivanek wrote: </pre> <blockquote type=3D"cite" class=3D""> <blockquote type=3D"cite" class=3D""> <pre class=3D"" wrap=3D"">On 4 Apr 2018, at = 15:36, Daniel Menzel <a class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:daniel.menzel@hhi.fraunhofer.de" = moz-do-not-send=3D"true"><daniel.menzel@hhi.fraunhofer.de></a> = wrote: Hello, we're successfully using a setup with 4 Nodes and a replicated Gluster = for storage. The engine is self hosted. What we're dealing with at the = moment is the high availability: If a node fails (for example simulated = by a forced power loss) the engine comes back up online withing ~2min. = But guests (having the HA option enabled) come back online only after a = very long grace time of ~5min. As we have a reliable network (40 GbE) = and reliable servers I think that the default grace times are way too = high for us - is there any possibility to change those values? </pre> </blockquote> <pre class=3D"" wrap=3D"">And do you have Power = Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have = to resort to relatively long timeouts to make sure the host is really = dead Thanks, michal </pre> <blockquote type=3D"cite" class=3D""> <pre class=3D"" wrap=3D"">Thanks in advance! Daniel _______________________________________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" href=3D"mailto:Users@ovirt.org" = moz-do-not-send=3D"true">Users@ovirt.org</a> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users" = moz-do-not-send=3D"true">http://lists.ovirt.org/mailman/listinfo/users</a>= </pre> </blockquote> </blockquote> </blockquote> </blockquote> <br class=3D""> <br class=3D""> <fieldset class=3D"mimeAttachmentHeader"></fieldset> <br class=3D""> <pre class=3D"" = wrap=3D"">_______________________________________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" href=3D"mailto:Users@ovirt.org" = moz-do-not-send=3D"true">Users@ovirt.org</a> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users" = moz-do-not-send=3D"true">http://lists.ovirt.org/mailman/listinfo/users</a>= </pre> </blockquote> <br class=3D""> </div> _______________________________________________<br class=3D"">= Users mailing list<br class=3D""> <a href=3D"mailto:Users@ovirt.org" class=3D"" = moz-do-not-send=3D"true">Users@ovirt.org</a><br class=3D""> <a class=3D"moz-txt-link-freetext" = href=3D"http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.= org/mailman/listinfo/users</a><br class=3D""> </div> </blockquote> </div> <br class=3D""> </blockquote> <br class=3D""> </div> </div></blockquote></div><br class=3D""></body></html>= --Apple-Mail=_55FDCE7C-F8D0-4565-BEE6-931F355AA1E1--

Eli Mesika

25 Apr 25 Apr

4:47 a.m.

On Mon, Apr 23, 2018 at 8:06 PM, Michal Skrivanek < michal.skrivanek@redhat.com> wrote:

...

On 23 Apr 2018, at 10:52, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de> wrote:

Hi Michal,

in your last mail you wrote, that the values can be turned down - how can this be done? H

AFAIK , there is no point in changing fencing vdc_options values in that

case (assuming no kdump is configured here ...) The Fencing mechanism is putting the host in "connecting" state for a grace period that depends on its number of running VMs and if it serves as APM or not While the host became non-responding , we first try to do a soft-fence (restart VDSM via ssh) , this will also take time After that point , if soft-fence is failing , the host will be reboot via the fencing script and the time it takes is totally depending on the host If you have something to look at , it is your host reboot time and try to improve it, if the host will reboot faster, the whole process will take less time ... Regards Eli

...

Best Daniel

On 12.04.2018 20:29, Michal Skrivanek wrote:

On 12 Apr 2018, at 13:13, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de> wrote:

Hi there,

does anyone have an idea how to decrease a virtual machine's downtime?

Best Daniel

On 06.04.2018 13:34, Daniel Menzel wrote:

Hi Michal,

Hi Daniel, adding Martin to review fencing behavior

(sorry for misspelling your name in my first mail).

that’s not the reason I’m replying late!:-))

The settings for the VMs are the following (oVirt 4.2):

1. HA checkbox enabled of course 2. "Target Storage Domain for VM Lease" -> left empty

if you need faster reactions then try to use VM Leases as well, it won’t make a difference in this case but will help in case of network issues. E.g. if you use iSCSI and the storage connection breaks while host connection still works it would restart the VM in about 80s; otherwise it would take >5 mins.

1. "Resume Behavior" -> AUTO_RESUME 2. Priority for Migration -> High 3. "Watchdog Model" -> No-Watchdog

For testing we did not kill any VM but the host. So basically we simulated an instantaneous crash by manually turning the machine off via IPMI-Interface (not via operating system!) and ping the guest(s). What happens then?

1. 2-3 seconds after the we press the host's shutdown button we lose ping contact to the VM(s). 2. After another 20s oVirt changes the host's status to "connecting", the VM's status is set to a question mark. 3. After ~1:30 the host is flagged to "non responsive”

that sounds about right. Now fencing action should have been initiated, if you can share the engine logs we can confirm that. IIRC we first try soft fencing - try to ssh to that host, that might take some time to time out I guess. Martin?

1. 2. After ~2:10 the host's reboot is initiated by oVirt, 5-10s later the guest is back online.

So, there seems to be one mistake I made in the first mail: The downtime is "only" 2.5min. But still I think this time can be decreased as for some services it is still quite a long time.

these values can be tuned down, but then you may be more susceptible to fencing power cycling a host in case of shorter network outages. It may be ok…depending on your requirements.

Best Daniel

On 06.04.2018 12:49, Michal Skrivanek wrote:

On 6 Apr 2018, at 12:45, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de> <daniel.menzel@hhi.fraunhofer.de> wrote:

Hi Michael, thanks for your mail. Sorry, I forgot to write that. Yes, we have power management and fencing enabled on all hosts. We also tested this and found out that it works perfectly. So this cannot be the reason I guess.

Hi Daniel, ok, then it’s worth looking into details. Can you describe in more detail what happens? What exact settings you’re using for such VM? Are you killing the HE VM or other VMs or both? Would be good to narrow it down a bit and then review the exact flow

Thanks, michal

Daniel

On 06.04.2018 11:11, Michal Skrivanek wrote:

On 4 Apr 2018, at 15:36, Daniel Menzel <daniel.menzel@hhi.fraunhofer.de> <daniel.menzel@hhi.fraunhofer.de> wrote:

Hello,

we're successfully using a setup with 4 Nodes and a replicated Gluster for storage. The engine is self hosted. What we're dealing with at the moment is the high availability: If a node fails (for example simulated by a forced power loss) the engine comes back up online withing ~2min. But guests (having the HA option enabled) come back online only after a very long grace time of ~5min. As we have a reliable network (40 GbE) and reliable servers I think that the default grace times are way too high for us - is there any possibility to change those values?

And do you have Power Management(iLO, iDRAC,etc) configured for your hosts? Otherwise we have to resort to relatively long timeouts to make sure the host is really dead Thanks, michal

Thanks in advance! Daniel

_______________________________________________ Users mailing listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

2858

Age (days ago)

2879

Last active (days ago)

List overview

Download

9 comments

3 participants

participants (3)

Daniel Menzel
Eli Mesika
Michal Skrivanek

Decrease downtime for HA

tags

participants (3)