From the sounds of it the best we can do then is to use a 60 second timeout on paths in multipathd.
The main reason we use Direct Lun is because we replicate /snapshot VMs associated Luns at SAN level as a means of disaster recovery.

I have read a bit of documentation of how to backup virtual machines in storage domains, but the process of mounting snapshots for all our machines within a dedicated VM doesn't seem as efficient when we have almost 300 virtual machines and only 1Gb networking.

Thanks for the advice.

Gary Lloyd
________________________________________________
I.T. Systems:Keele University
Finance & IT Directorate
Keele:Staffs:IC1 Building:ST5 5NB:UK
+44 1782 733063
________________________________________________

On 6 October 2016 at 11:07, Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Oct 6, 2016 at 10:19 AM, Gary Lloyd <g.lloyd@keele.ac.uk> wrote:
I asked on the Dell Storage Forum and they recommend the following:

I recommend not using a numeric value for the "no_path_retry" variable within /etc/multipath.conf as once that numeric value is reached, if no healthy LUNs were discovered during that defined time multipath will disable the I/O queue altogether.

I do recommend, however, changing the variable value from "12" (or even "60") to "queue" which will then allow multipathd to continue queing I/O until a healthy LUN is discovered (time of fail-over between controllers) and I/O is allowed to flow once again.

Can you see any issues with this recommendation as far as Ovirt is concerned ?

Yes, we cannot work with unlimited queue. This will block vdsm for unlimited
time when the next command try to access storage. Because we don't have
good isolation between different storage domains, this may cause other storage
domains to become faulty. Also engine flows that have a timeout will fail with
a timeout.

If you are on 3.x, this will be very painfull, on 4.0 it should be better, but it is not
recommended.

Nir