[ovirt-users] What recovers a VM from pause?

Mon May 30 19:09:25 UTC 2016

On Mon, May 30, 2016 at 4:07 PM, Nicolas Ecarnot <nicolas at ecarnot.net> wrote:
> Hello,
>
> We're planning a move from our old building towards a new one a few meters
> away.
>
>
>
> In a similar way of Martijn
> (https://www.mail-archive.com/users@ovirt.org/msg33182.html), I have
> maintenance planed on our storage side.
>
> Say an oVirt DC is using a SAN's LUN via iSCSI (Equallogic).
> This SAN allows me to setup block replication between two SANs, seen by
> oVirt as one (Dell is naming it SyncRep).
> Then switch all the iSCSI accesses to the replicated LUN.
>
> When doing this, the iSCSI stack of each oVirt host notices the
> de-connection, tries to reconnect, and succeeds.
> Amongst our hosts, this happens between 4 and 15 seconds.
>
> When this happens fast enough, oVirt engine and the VMs don't even notice,
> and they keep running happily.
>
> When this takes more than 4 seconds, there are 2 cases :
>
> 1 - The hosts and/or oVirt and/or the SPM (I actually don't know) notices
> that there is a storage failure, and pauses the VMs.
> When the iSCSI stack reconnects, the VMs are automatically recovered from
> pause, and this all takes less than 30 seconds. That is very acceptable for
> us, as this action is extremely rare.
>
> 2 - Same storage failure, VMs paused, and some VMs stay in pause mode
> forever.
> Manual "run" action is mandatory.
> When done, everything recovers correctly.
> This is also quite acceptable, but here come my questions :
>
> My questions : (!)
> - *WHAT* process or piece of code or what oVirt parts is responsible for
> deciding when to UN-pause a VM, and at what conditions?

Vms get paused by qemu, when you get ENOSPC or some other IO error.
This probably happens when a vm is writing to storage, and all paths to storage
are faulty - with current configuration, the scsi layer will fail
after 5 seconds,
and if no path is available, the write will fail.

If vdsm storage monitoring system detected the issue, the storage domain
will become invalid. When the storage domain will become valid again, we
try to resume all vms paused because of IO errors.

Storage monitoring is done every 10 seconds in normal conditions, but in
current release, there can be delays of up to couple of minutes in
extreme conditions,
for example, 50 storage domains and doing lot of io. So basically, the
storage domain
monitor may miss an error on storage, never become invalid, and would
never become valid again and the vm will have to be resumed manually.
See https://bugzilla.redhat.com/1081962

In ovirt 4.0 monitoring should be improved, and will always monitor
storage every
10 seconds, but even this cannot guarantee that we will detect all
storage errors
For example, if the storage outage is shorter then 10 seconds. But I
guess that chance
that storage outage was shorter then 10 seconds, but long enough to cause a vm
to pause is very low.

> That would help me to understand why some cases are working even more
> smoothly than others.
> - Are there related timeouts I could play with in engine-config options?

Nothing on the engine side...

> - [a bit off-topic] Is it safe to increase some iSCSI timeouts of
> buffer-sizes in the hope this kind of disconnection would get un-noticed?

But you may modify multipath configuration on the host.

We use now this multipath configuration (/etc/multipath.conf):

# VDSM REVISION 1.3

defaults {
    polling_interval            5
    no_path_retry               fail
    user_friendly_names         no
    flush_on_last_del           yes
    fast_io_fail_tmo            5
    dev_loss_tmo                30
    max_fds                     4096
    deferred_remove             yes
}

devices {
    device {
        all_devs                yes
        no_path_retry           fail
    }
}

This enforces failing of io request on devices that by default will queue such
requests for long or unlimited time. Queuing requests is very bad for vdsm, and
cause various commands to block for minutes during storage outage,
failing various
flows in vdsm and the ui.
See https://bugzilla.redhat.com/880738

However, in your case, using queuing may be the best way to do the switch
from one storage to another in the smoothest way.

You may try this setting:

devices {
    device {
        all_devs                yes
        no_path_retry           30
    }
}

This will queue io requests for 30 seconds before failing.
Using this normally would be a bad idea with vdsm, since during storage outage,
vdsm may block for 30 seconds when no paths is available, and is not designed
for this behavior, but blocking from time to time for short time should be ok.

I think that modifying the configuration and reloading multipathd service should
be enough to use the new settings, but I'm not sure if this changes
existing sessions
or open devices.

Adding Ben to add more info about this.

Nir