Thanks for the feedback Nir.
I agree in general that having an additional engine config for disk
level error handling default would be the right way. It would be good to
decide the granularity. Would it make sense to have this for a specific
disk type like lun or would you prefer to make it generic for all types?
Thanks,
Shubha
On 7/28/2020 2:03 PM, Nir Soffer wrote:
> On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni
> <shubha.kulkarni(a)oracle.com> wrote:
>> Hello,
>>
>> In OVirt, we have a property propagate_error at the disk level that
>> decides in case of an error, how this error be propagated to the VM.
>> This value is maintained in the database table with the default value
>> set as Off. The default setting(Off) results in a policy that ends up
>> pausing the VM rather than propagating the errors to VM. There is no
>> provision in the UI currently to configure this property for disk
>> (images or luns). So there is no easy way to set this value. Further,
>> even if the value is manually set to "On" in db, it gets overwriiten
by
>> UI everytime some other property is updated as described here -
>>
https://bugzilla.redhat.com/show_bug.cgi?id=1669367
>>
>> Setting the value to "Off" is not ideal for multipath devices where a
>> single path failure causes vm to pause.
> Single path failure should be transparent to qemu. multipath will fail over
> the I/O to another path. The I/O will fail only if all paths are down, and
> (with the default configuration), multipath path checkers failed 4 times.
>
>> It puts serious restrictions for
>> the DR situation and unlike VMWare * Hyper-V, oVirt is not able to
>> support the DR functionality -
>>
https://bugzilla.redhat.com/show_bug.cgi?id=1314160
> Alghouth in this bug we see that failover that looks successful from multipath
> and vdsm point of view ended in paused VM:
>
https://bugzilla.redhat.com/1860377
>
> Maybe Ben can explain how this can happen.
>
> I hope that qemu will provide more info on errors in the future. If we had a log
> about the failure I/O it could be helpful.
>
>> While we wait for RFE, the proposal here is to revise the out of the box
>> behavior for LUNs. For LUNs, we should propagate the errors to VM rather
>> than directly stopping those. This will allow us to handle short-term
>> multipath outages and improve availability. This is a simple change in
>> behavior but will have good positive impact. I would like to seek
>> feedback about this to make sure that everyone is ok with the proposal.
> I think it makes sense, but this is just a default, and it cannot work
> for all cases.
>
> This can end in broken VM with read only file system that must be
> rebooted, while
> with error_policy="stop", failover may be transparent to the VM even
> if it was paused
> for a short time.
>
> I would start by making engine defaults configurable using engine
> config, so different
> oVirt distributions can use different defaults.
>
> Nir
>