[ovirt-users] VMs paused due to IO issues - Dell Equallogic controller failover

Thu Oct 6 07:19:36 UTC 2016

I asked on the Dell Storage Forum and they recommend the following:

*I recommend not using a numeric value for the "no_path_retry" variable
within /etc/multipath.conf as once that numeric value is reached, if no
healthy LUNs were discovered during that defined time multipath will
disable the I/O queue altogether.*

*I do recommend, however, changing the variable value from "12" (or even
"60") to "queue" which will then allow multipathd to continue queing I/O
until a healthy LUN is discovered (time of fail-over between controllers)
and I/O is allowed to flow once again.*

Can you see any issues with this recommendation as far as Ovirt is
concerned ?

Thanks again

*Gary Lloyd*
________________________________________________
I.T. Systems:Keele University
Finance & IT Directorate
Keele:Staffs:IC1 Building:ST5 5NB:UK
+44 1782 733063 <%2B44%201782%20733073>
________________________________________________

On 4 October 2016 at 19:11, Nir Soffer <nsoffer at redhat.com> wrote:

> On Tue, Oct 4, 2016 at 10:51 AM, Gary Lloyd <g.lloyd at keele.ac.uk> wrote:
>
>> Hi
>>
>> We have Ovirt 3.65 with a Dell Equallogic SAN and we use Direct Luns for
>> all our VMs.
>> At the weekend during early hours an Equallogic controller failed over to
>> its standby on one of our arrays and this caused about 20 of our VMs to be
>> paused due to IO problems.
>>
>> I have also noticed that this happens during Equallogic firmware upgrades
>> since we moved onto Ovirt 3.65.
>>
>> As recommended by Dell disk timeouts within the VMs are set to 60 seconds
>> when they are hosted on an EqualLogic SAN.
>>
>> Is there any other timeout value that we can configure in vdsm.conf to
>> stop VMs from getting paused when a controller fails over ?
>>
>
> You can set the timeout in multipath.conf.
>
> With current multipath configuration (deployed by vdsm), when all paths to
> a device
> are lost (e.g. you take down all ports on the server during upgrade), all
> io will fail
> immediately.
>
> If you want to allow 60 seconds gracetime in such case, you can configure:
>
>     no_path_retry 12
>
> This will continue to monitor the paths 12 times, each 5 seconds
> (assuming polling_interval=5). If some path recover during this time, the
> io
> can complete and the vm will not be paused.
>
> If no path is available after these retries, io will fail and vms with
> pending io
> will pause.
>
> Note that this will also cause delays in vdsm in various flows, increasing
> the chance
> of timeouts in engine side, or delays in storage domain monitoring.
>
> However, the 60 seconds delay is expected only on the first time all paths
> become
> faulty. Once the timeout has expired, any access to the device will fail
> immediately.
>
> To configure this, you must add the # VDSM PRIVATE tag at the second line
> of
> multipath.conf, otherwise vdsm will override your configuration in the
> next time
> you run vdsm-tool configure.
>
> multipath.conf should look like this:
>
> # VDSM REVISION 1.3
> # VDSM PRIVATE
>
> defaults {
>     polling_interval            5
>     no_path_retry               12
>     user_friendly_names         no
>     flush_on_last_del           yes
>     fast_io_fail_tmo            5
>     dev_loss_tmo                30
>     max_fds                     4096
> }
>
> devices {
>     device {
>         all_devs                yes
>         no_path_retry           12
>     }
> }
>
> This will use 12 retries (60 seconds) timeout for any device. If you like
> to
> configure only your specific device, you can add a device section for
> your specific server instead.
>
>
>>
>> Also is there anything that we can tweak to automatically unpause the VMs
>> once connectivity with the arrays is re-established ?
>>
>
> Vdsm will resume the vms when storage monitor detect that storage became
> available again.
> However we cannot guarantee that storage monitoring will detect that
> storage was down.
> This should be improved in 4.0.
>
>
>> At the moment we are running a customized version of storageServer.py, as
>> Ovirt has yet to include iscsi multipath support for Direct Luns out of the
>> box.
>>
>
> Would you like to share this code?
>
> Nir
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20161006/074fd662/attachment-0001.html>