[ovirt-devel] Re: Improving VM behavior in case of IO errors

Monday, 10 August 2020

Thanks Nir for the response.

I have same question as Kevin. It will be great if you can share your 
thoughts on that.

Thanks,

Shubha

On 8/10/2020 4:53 AM, Kevin Wolf wrote:
> Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:
>> On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni
>> <shubha.kulkarni(a)oracle.com&gt; wrote:
>>> Thanks for the feedback Nir.
>>>
>>> I agree in general that having an additional engine config for disk
>>> level error handling default would be the right way. It would be good to
>>> decide the granularity. Would it make sense to have this for a specific
>>> disk type like lun or would you prefer to make it generic for all types?
>> This must be for a specific disk type, since for thin images on block
>> storage we cannot support propagating errors to the guest. This will
>> break thin provisioning.
> Is werror=enospc not enough for thin provisioning to work? This will
> still stop the guest for any other kinds of I/O errors.
>
> Kevin
>
>> Handling the LUN use case first seems like the best way, since in this case we
>> don't manage the LUN and we don't support resuming paused using LUNs
yet,
>> so propagating the error may be more useful.
>>
>> Managed Block Storage (cinderlib based disks) are very much like
>> direct LUN. In this
>> case we do manage the disks on the server, but otherwise we don't
>> support anything
>> on the host (e.g. monitoring, resuming paused VMs) so propagating the error like
>> direct LUNs may be more useful.
>>
>> Images are a bigger problem since thin disks cannot support
>> propagating errors but
>> preallocated disks can. But once you create a snapshot prealocated disks behave
>> exactly like thin disks because they are the same.
>>
>> Snapshots are also created automatically in for preallocated images,
>> for example during
>> live storage migration, and deleted automatically after the migration.
>> So you cannot
>> assume that having only preallocated disks is good for propagating errors.
>>
>> Even if you limit this option to file based storage, this is going to
>> break when you migrate
>> the disks to block storage.
>>
>> Nir
>>
>>> Thanks,
>>>
>>> Shubha
>>>
>>> On 7/28/2020 2:03 PM, Nir Soffer wrote:
>>>> On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni
>>>> <shubha.kulkarni(a)oracle.com&gt; wrote:
>>>>> Hello,
>>>>>
>>>>> In OVirt, we have a property propagate_error at the disk level that
>>>>> decides in case of an error, how this error be propagated to the VM.
>>>>> This value is maintained in the database table with the default
value
>>>>> set as Off. The default setting(Off) results in a policy that ends
up
>>>>> pausing the VM rather than propagating the errors to VM.  There is
no
>>>>> provision in the UI currently to configure this property for disk
>>>>> (images or luns). So there is no easy way to set this value. 
Further,
>>>>> even if the value is manually set to "On" in db, it gets
overwriiten by
>>>>> UI everytime some other property is updated as described here -
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1669367
>>>>>
>>>>> Setting the value to "Off" is not ideal for multipath
devices where a
>>>>> single path failure causes vm to pause.
>>>> Single path failure should be transparent to qemu. multipath will fail
over
>>>> the I/O to another path. The I/O will fail only if all paths are down,
and
>>>> (with the default configuration), multipath path checkers failed 4
times.
>>>>
>>>>> It puts serious restrictions for
>>>>> the DR situation and unlike VMWare * Hyper-V, oVirt is not able to
>>>>> support the DR functionality -
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1314160
>>>> Alghouth in this bug we see that failover that looks successful from
multipath
>>>> and vdsm point of view ended in paused VM:
>>>> https://bugzilla.redhat.com/1860377
>>>>
>>>> Maybe Ben can explain how this can happen.
>>>>
>>>> I hope that qemu will provide more info on errors in the future. If we
had a log
>>>> about the failure I/O it could be helpful.
>>>>
>>>>> While we wait for RFE, the proposal here is to revise the out of the
box
>>>>> behavior for LUNs. For LUNs, we should propagate the errors to VM
rather
>>>>> than directly stopping those. This will allow us to handle
short-term
>>>>> multipath outages and improve availability. This is a simple change
in
>>>>> behavior but will have good positive impact. I would like to seek
>>>>> feedback about this to make sure that everyone is ok with the
proposal.
>>>> I think it makes sense, but this is just a default, and it cannot work
>>>> for all cases.
>>>>
>>>> This can end in broken VM with read only file system that must be
>>>> rebooted, while
>>>> with error_policy="stop", failover may be transparent to the VM
even
>>>> if it was paused
>>>> for a short time.
>>>>
>>>> I would start by making engine defaults configurable using engine
>>>> config, so different
>>>> oVirt distributions can use different defaults.
>>>>
>>>> Nir
>>>>

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-devel] Re: Improving VM behavior in case of IO errors