<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 6, 2017 at 1:45 PM, Matthew Trent <span dir="ltr">&lt;<a href="mailto:Matthew.Trent@lewiscountywa.gov" target="_blank">Matthew.Trent@lewiscountywa.gov</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thanks for the replies, all!<br>

<br>

Yep, Chris is right. TrueNAS HA is active/passive and there isn&#39;t a way around that when failing between heads.<br></blockquote><div><br></div><div>General comment - 30 seconds is A LOT. Many application-level IO might timeout. Most storage strive to remain lower than that.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Sven: In my experience with iX support, they have directed me to reboot the active node to initiate failover. There&#39;s &quot;hactl takeover&quot; and &quot;hactl giveback&quot; commends, but reboot seems to be their preferred method.<br>

<br>

VMs going into a paused state and resuming when storage is back online sounds great. As long as oVirt&#39;s pause/resume isn&#39;t significantly slower than the 30-or-so seconds the TrueNAS takes to complete its failover, that&#39;s a pretty tolerable interruption for my needs. So my next questions are:<br>

<br>

1) Assuming the SAN failover DOES work correctly, can anyone comment on their experience with oVirt pausing/thawing VMs in an NFS-based active/passive SAN failover scenario? Does it work reliably without intervention? Is it reasonably fast?<br></blockquote><div><br></div><div>oVirt is not pausing VMs. qemu-kvm pauses the specific VM that issues an IO and that IO is stuck. The reason is that the VM cannot reliably continue without a concern for data loss (the data is in-flight somewhere, right? host kernel, NIC buffers, etc.)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

2) Is there anything else in the oVirt stack that might cause it to &quot;freak out&quot; rather than gracefully pause/unpause VMs?<br></blockquote><div><br></div><div>We do monitor storage domain health regularly. We are working on ignoring short hiccups (see <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1459370">https://bugzilla.redhat.com/show_bug.cgi?id=1459370</a> for example).</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

2a) Particularly: I&#39;m running hosted engine on the same TrueNAS storage. Does that change anything WRT to timeouts and oVirt&#39;s HA and fencing and sanlock and such?<br>

<br>

2b) Is there a limit to how long oVirt will wait for storage before doing something more drastic than just pausing VMs?<br></blockquote><div><br></div><div>As explained above, generally, no. We can&#39;t do much tbh, and we&#39;d like to ensure there is no data loss.</div><div>That being said, in extreme cases hosts may become unresponsive - if you have fencing they may even be fenced (there&#39;s an option to fence a host which cannot renew its storage lease). We have not seen that happening for quite some time, and I don&#39;t anticipate short storage hiccups to cause that , though.</div><div>Depending on your application, it may be the right thing to do, btw.</div><div>Y.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<span class="gmail-"><br>

--<br>

Matthew Trent<br>

Network Engineer<br>

Lewis County IT Services<br>

</span><a href="tel:360.740.1247" value="+13607401247">360.740.1247</a> - Helpdesk<br>

<a href="tel:360.740.3343" value="+13607403343">360.740.3343</a> - Direct line<br>

<br>

______________________________<wbr>__________<br>

From: <a href="mailto:users-bounces@ovirt.org">users-bounces@ovirt.org</a> &lt;<a href="mailto:users-bounces@ovirt.org">users-bounces@ovirt.org</a>&gt; on behalf of Chris Adams &lt;<a href="mailto:cma@cmadams.net">cma@cmadams.net</a>&gt;<br>

Sent: Tuesday, June 6, 2017 7:21 AM<br>

To: <a href="mailto:users@ovirt.org">users@ovirt.org</a><br>

Subject: Re: [ovirt-users] Seamless SAN HA failovers with oVirt?<br>

<div class="gmail-HOEnZb"><div class="gmail-h5"><br>

Once upon a time, Juan Pablo &lt;<a href="mailto:pablo.localhost@gmail.com">pablo.localhost@gmail.com</a>&gt; said:<br>

&gt; Chris, if you have active-active with multipath: you upgrade one system,<br>

&gt; reboot it, check it came active again, then upgrade the other.<br>

<br>

Yes, but that&#39;s still not how a TrueNAS (and most other low- to<br>

mid-range SANs) works, so is not relevant.  The TrueNAS only has a<br>

single active node talking to the hard drives at a time, because having<br>

two nodes talking to the same storage at the same time is a hard problem<br>

to solve (typically requires custom hardware with active cache coherency<br>

and such).<br>

<br>

You can (and should) use multipath between servers and a TrueNAS, and<br>

that protects against NIC, cable, and switch failures, but does not help<br>

with a controller failure/reboot/upgrade.  Multipath is also used to<br>

provide better bandwidth sharing between links than ethernet LAGs.<br>

<br>

--<br>

Chris Adams &lt;<a href="mailto:cma@cmadams.net">cma@cmadams.net</a>&gt;<br>

______________________________<wbr>_________________<br>

Users mailing list<br>

<a href="mailto:Users@ovirt.org">Users@ovirt.org</a><br>

<a href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/users</a><br>

______________________________<wbr>_________________<br>

Users mailing list<br>

<a href="mailto:Users@ovirt.org">Users@ovirt.org</a><br>

<a href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/users</a><br>

</div></div></blockquote></div><br></div></div>