[ovirt-users] VMs paused due to IO issues - Dell Equallogic controller failover

Gary Lloyd g.lloyd at keele.ac.uk
Fri Oct 7 07:37:07 UTC 2016


>From the sounds of it the best we can do then is to use a 60 second timeout
on paths in multipathd.
The main reason we use Direct Lun is because we replicate /snapshot VMs
associated Luns at SAN level as a means of disaster recovery.

I have read a bit of documentation of how to backup virtual machines in
storage domains, but the process of mounting snapshots for all our machines
within a dedicated VM doesn't seem as efficient when we have almost 300
virtual machines and only 1Gb networking.

Thanks for the advice.

*Gary Lloyd*
________________________________________________
I.T. Systems:Keele University
Finance & IT Directorate
Keele:Staffs:IC1 Building:ST5 5NB:UK
+44 1782 733063 <%2B44%201782%20733073>
________________________________________________

On 6 October 2016 at 11:07, Nir Soffer <nsoffer at redhat.com> wrote:

> On Thu, Oct 6, 2016 at 10:19 AM, Gary Lloyd <g.lloyd at keele.ac.uk> wrote:
>
>> I asked on the Dell Storage Forum and they recommend the following:
>>
>> *I recommend not using a numeric value for the "no_path_retry" variable
>> within /etc/multipath.conf as once that numeric value is reached, if no
>> healthy LUNs were discovered during that defined time multipath will
>> disable the I/O queue altogether.*
>>
>> *I do recommend, however, changing the variable value from "12" (or even
>> "60") to "queue" which will then allow multipathd to continue queing I/O
>> until a healthy LUN is discovered (time of fail-over between controllers)
>> and I/O is allowed to flow once again.*
>>
>> Can you see any issues with this recommendation as far as Ovirt is
>> concerned ?
>>
> Yes, we cannot work with unlimited queue. This will block vdsm for
> unlimited
> time when the next command try to access storage. Because we don't have
> good isolation between different storage domains, this may cause other
> storage
> domains to become faulty. Also engine flows that have a timeout will fail
> with
> a timeout.
>
> If you are on 3.x, this will be very painfull, on 4.0 it should be better,
> but it is not
> recommended.
>
> Nir
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20161007/b92cdf9e/attachment-0001.html>


More information about the Users mailing list