Improving VM behavior in case of IO errors

Hello, In OVirt, we have a property propagate_error at the disk level that decides in case of an error, how this error be propagated to the VM. This value is maintained in the database table with the default value set as Off. The default setting(Off) results in a policy that ends up pausing the VM rather than propagating the errors to VM. There is no provision in the UI currently to configure this property for disk (images or luns). So there is no easy way to set this value. Further, even if the value is manually set to "On" in db, it gets overwriiten by UI everytime some other property is updated as described here - https://bugzilla.redhat.com/show_bug.cgi?id=1669367 Setting the value to "Off" is not ideal for multipath devices where a single path failure causes vm to pause. It puts serious restrictions for the DR situation and unlike VMWare * Hyper-V, oVirt is not able to support the DR functionality - https://bugzilla.redhat.com/show_bug.cgi?id=1314160 While we wait for RFE, the proposal here is to revise the out of the box behavior for LUNs. For LUNs, we should propagate the errors to VM rather than directly stopping those. This will allow us to handle short-term multipath outages and improve availability. This is a simple change in behavior but will have good positive impact. I would like to seek feedback about this to make sure that everyone is ok with the proposal. Thanks, Shubha

On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote:
Hello,
In OVirt, we have a property propagate_error at the disk level that decides in case of an error, how this error be propagated to the VM. This value is maintained in the database table with the default value set as Off. The default setting(Off) results in a policy that ends up pausing the VM rather than propagating the errors to VM. There is no provision in the UI currently to configure this property for disk (images or luns). So there is no easy way to set this value. Further, even if the value is manually set to "On" in db, it gets overwriiten by UI everytime some other property is updated as described here - https://bugzilla.redhat.com/show_bug.cgi?id=1669367
Setting the value to "Off" is not ideal for multipath devices where a single path failure causes vm to pause.
Single path failure should be transparent to qemu. multipath will fail over the I/O to another path. The I/O will fail only if all paths are down, and (with the default configuration), multipath path checkers failed 4 times.
It puts serious restrictions for the DR situation and unlike VMWare * Hyper-V, oVirt is not able to support the DR functionality - https://bugzilla.redhat.com/show_bug.cgi?id=1314160
Alghouth in this bug we see that failover that looks successful from multipath and vdsm point of view ended in paused VM: https://bugzilla.redhat.com/1860377 Maybe Ben can explain how this can happen. I hope that qemu will provide more info on errors in the future. If we had a log about the failure I/O it could be helpful.
While we wait for RFE, the proposal here is to revise the out of the box behavior for LUNs. For LUNs, we should propagate the errors to VM rather than directly stopping those. This will allow us to handle short-term multipath outages and improve availability. This is a simple change in behavior but will have good positive impact. I would like to seek feedback about this to make sure that everyone is ok with the proposal.
I think it makes sense, but this is just a default, and it cannot work for all cases. This can end in broken VM with read only file system that must be rebooted, while with error_policy="stop", failover may be transparent to the VM even if it was paused for a short time. I would start by making engine defaults configurable using engine config, so different oVirt distributions can use different defaults. Nir

Thanks for the feedback Nir. I agree in general that having an additional engine config for disk level error handling default would be the right way. It would be good to decide the granularity. Would it make sense to have this for a specific disk type like lun or would you prefer to make it generic for all types? Thanks, Shubha On 7/28/2020 2:03 PM, Nir Soffer wrote:
Hello,
In OVirt, we have a property propagate_error at the disk level that decides in case of an error, how this error be propagated to the VM. This value is maintained in the database table with the default value set as Off. The default setting(Off) results in a policy that ends up pausing the VM rather than propagating the errors to VM. There is no provision in the UI currently to configure this property for disk (images or luns). So there is no easy way to set this value. Further, even if the value is manually set to "On" in db, it gets overwriiten by UI everytime some other property is updated as described here - https://bugzilla.redhat.com/show_bug.cgi?id=1669367
Setting the value to "Off" is not ideal for multipath devices where a single path failure causes vm to pause. Single path failure should be transparent to qemu. multipath will fail over
On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote: the I/O to another path. The I/O will fail only if all paths are down, and (with the default configuration), multipath path checkers failed 4 times.
It puts serious restrictions for the DR situation and unlike VMWare * Hyper-V, oVirt is not able to support the DR functionality - https://bugzilla.redhat.com/show_bug.cgi?id=1314160 Alghouth in this bug we see that failover that looks successful from multipath and vdsm point of view ended in paused VM: https://bugzilla.redhat.com/1860377
Maybe Ben can explain how this can happen.
I hope that qemu will provide more info on errors in the future. If we had a log about the failure I/O it could be helpful.
While we wait for RFE, the proposal here is to revise the out of the box behavior for LUNs. For LUNs, we should propagate the errors to VM rather than directly stopping those. This will allow us to handle short-term multipath outages and improve availability. This is a simple change in behavior but will have good positive impact. I would like to seek feedback about this to make sure that everyone is ok with the proposal. I think it makes sense, but this is just a default, and it cannot work for all cases.
This can end in broken VM with read only file system that must be rebooted, while with error_policy="stop", failover may be transparent to the VM even if it was paused for a short time.
I would start by making engine defaults configurable using engine config, so different oVirt distributions can use different defaults.
Nir

On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote:
Thanks for the feedback Nir.
I agree in general that having an additional engine config for disk level error handling default would be the right way. It would be good to decide the granularity. Would it make sense to have this for a specific disk type like lun or would you prefer to make it generic for all types?
This must be for a specific disk type, since for thin images on block storage we cannot support propagating errors to the guest. This will break thin provisioning. Handling the LUN use case first seems like the best way, since in this case we don't manage the LUN and we don't support resuming paused using LUNs yet, so propagating the error may be more useful. Managed Block Storage (cinderlib based disks) are very much like direct LUN. In this case we do manage the disks on the server, but otherwise we don't support anything on the host (e.g. monitoring, resuming paused VMs) so propagating the error like direct LUNs may be more useful. Images are a bigger problem since thin disks cannot support propagating errors but preallocated disks can. But once you create a snapshot prealocated disks behave exactly like thin disks because they are the same. Snapshots are also created automatically in for preallocated images, for example during live storage migration, and deleted automatically after the migration. So you cannot assume that having only preallocated disks is good for propagating errors. Even if you limit this option to file based storage, this is going to break when you migrate the disks to block storage. Nir
Thanks,
Shubha
On 7/28/2020 2:03 PM, Nir Soffer wrote:
Hello,
In OVirt, we have a property propagate_error at the disk level that decides in case of an error, how this error be propagated to the VM. This value is maintained in the database table with the default value set as Off. The default setting(Off) results in a policy that ends up pausing the VM rather than propagating the errors to VM. There is no provision in the UI currently to configure this property for disk (images or luns). So there is no easy way to set this value. Further, even if the value is manually set to "On" in db, it gets overwriiten by UI everytime some other property is updated as described here - https://bugzilla.redhat.com/show_bug.cgi?id=1669367
Setting the value to "Off" is not ideal for multipath devices where a single path failure causes vm to pause. Single path failure should be transparent to qemu. multipath will fail over
On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote: the I/O to another path. The I/O will fail only if all paths are down, and (with the default configuration), multipath path checkers failed 4 times.
It puts serious restrictions for the DR situation and unlike VMWare * Hyper-V, oVirt is not able to support the DR functionality - https://bugzilla.redhat.com/show_bug.cgi?id=1314160 Alghouth in this bug we see that failover that looks successful from multipath and vdsm point of view ended in paused VM: https://bugzilla.redhat.com/1860377
Maybe Ben can explain how this can happen.
I hope that qemu will provide more info on errors in the future. If we had a log about the failure I/O it could be helpful.
While we wait for RFE, the proposal here is to revise the out of the box behavior for LUNs. For LUNs, we should propagate the errors to VM rather than directly stopping those. This will allow us to handle short-term multipath outages and improve availability. This is a simple change in behavior but will have good positive impact. I would like to seek feedback about this to make sure that everyone is ok with the proposal. I think it makes sense, but this is just a default, and it cannot work for all cases.
This can end in broken VM with read only file system that must be rebooted, while with error_policy="stop", failover may be transparent to the VM even if it was paused for a short time.
I would start by making engine defaults configurable using engine config, so different oVirt distributions can use different defaults.
Nir

Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:
On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote:
Thanks for the feedback Nir.
I agree in general that having an additional engine config for disk level error handling default would be the right way. It would be good to decide the granularity. Would it make sense to have this for a specific disk type like lun or would you prefer to make it generic for all types?
This must be for a specific disk type, since for thin images on block storage we cannot support propagating errors to the guest. This will break thin provisioning.
Is werror=enospc not enough for thin provisioning to work? This will still stop the guest for any other kinds of I/O errors. Kevin
Handling the LUN use case first seems like the best way, since in this case we don't manage the LUN and we don't support resuming paused using LUNs yet, so propagating the error may be more useful.
Managed Block Storage (cinderlib based disks) are very much like direct LUN. In this case we do manage the disks on the server, but otherwise we don't support anything on the host (e.g. monitoring, resuming paused VMs) so propagating the error like direct LUNs may be more useful.
Images are a bigger problem since thin disks cannot support propagating errors but preallocated disks can. But once you create a snapshot prealocated disks behave exactly like thin disks because they are the same.
Snapshots are also created automatically in for preallocated images, for example during live storage migration, and deleted automatically after the migration. So you cannot assume that having only preallocated disks is good for propagating errors.
Even if you limit this option to file based storage, this is going to break when you migrate the disks to block storage.
Nir
Thanks,
Shubha
On 7/28/2020 2:03 PM, Nir Soffer wrote:
Hello,
In OVirt, we have a property propagate_error at the disk level that decides in case of an error, how this error be propagated to the VM. This value is maintained in the database table with the default value set as Off. The default setting(Off) results in a policy that ends up pausing the VM rather than propagating the errors to VM. There is no provision in the UI currently to configure this property for disk (images or luns). So there is no easy way to set this value. Further, even if the value is manually set to "On" in db, it gets overwriiten by UI everytime some other property is updated as described here - https://bugzilla.redhat.com/show_bug.cgi?id=1669367
Setting the value to "Off" is not ideal for multipath devices where a single path failure causes vm to pause. Single path failure should be transparent to qemu. multipath will fail over
On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote: the I/O to another path. The I/O will fail only if all paths are down, and (with the default configuration), multipath path checkers failed 4 times.
It puts serious restrictions for the DR situation and unlike VMWare * Hyper-V, oVirt is not able to support the DR functionality - https://bugzilla.redhat.com/show_bug.cgi?id=1314160 Alghouth in this bug we see that failover that looks successful from multipath and vdsm point of view ended in paused VM: https://bugzilla.redhat.com/1860377
Maybe Ben can explain how this can happen.
I hope that qemu will provide more info on errors in the future. If we had a log about the failure I/O it could be helpful.
While we wait for RFE, the proposal here is to revise the out of the box behavior for LUNs. For LUNs, we should propagate the errors to VM rather than directly stopping those. This will allow us to handle short-term multipath outages and improve availability. This is a simple change in behavior but will have good positive impact. I would like to seek feedback about this to make sure that everyone is ok with the proposal. I think it makes sense, but this is just a default, and it cannot work for all cases.
This can end in broken VM with read only file system that must be rebooted, while with error_policy="stop", failover may be transparent to the VM even if it was paused for a short time.
I would start by making engine defaults configurable using engine config, so different oVirt distributions can use different defaults.
Nir

Thanks Nir for the response. I have same question as Kevin. It will be great if you can share your thoughts on that. Thanks, Shubha On 8/10/2020 4:53 AM, Kevin Wolf wrote:
Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:
On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote:
Thanks for the feedback Nir.
I agree in general that having an additional engine config for disk level error handling default would be the right way. It would be good to decide the granularity. Would it make sense to have this for a specific disk type like lun or would you prefer to make it generic for all types? This must be for a specific disk type, since for thin images on block storage we cannot support propagating errors to the guest. This will break thin provisioning. Is werror=enospc not enough for thin provisioning to work? This will still stop the guest for any other kinds of I/O errors.
Kevin
Handling the LUN use case first seems like the best way, since in this case we don't manage the LUN and we don't support resuming paused using LUNs yet, so propagating the error may be more useful.
Managed Block Storage (cinderlib based disks) are very much like direct LUN. In this case we do manage the disks on the server, but otherwise we don't support anything on the host (e.g. monitoring, resuming paused VMs) so propagating the error like direct LUNs may be more useful.
Images are a bigger problem since thin disks cannot support propagating errors but preallocated disks can. But once you create a snapshot prealocated disks behave exactly like thin disks because they are the same.
Snapshots are also created automatically in for preallocated images, for example during live storage migration, and deleted automatically after the migration. So you cannot assume that having only preallocated disks is good for propagating errors.
Even if you limit this option to file based storage, this is going to break when you migrate the disks to block storage.
Nir
Thanks,
Shubha
On 7/28/2020 2:03 PM, Nir Soffer wrote:
Hello,
In OVirt, we have a property propagate_error at the disk level that decides in case of an error, how this error be propagated to the VM. This value is maintained in the database table with the default value set as Off. The default setting(Off) results in a policy that ends up pausing the VM rather than propagating the errors to VM. There is no provision in the UI currently to configure this property for disk (images or luns). So there is no easy way to set this value. Further, even if the value is manually set to "On" in db, it gets overwriiten by UI everytime some other property is updated as described here - https://bugzilla.redhat.com/show_bug.cgi?id=1669367
Setting the value to "Off" is not ideal for multipath devices where a single path failure causes vm to pause. Single path failure should be transparent to qemu. multipath will fail over
On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote: the I/O to another path. The I/O will fail only if all paths are down, and (with the default configuration), multipath path checkers failed 4 times.
It puts serious restrictions for the DR situation and unlike VMWare * Hyper-V, oVirt is not able to support the DR functionality - https://bugzilla.redhat.com/show_bug.cgi?id=1314160 Alghouth in this bug we see that failover that looks successful from multipath and vdsm point of view ended in paused VM: https://bugzilla.redhat.com/1860377
Maybe Ben can explain how this can happen.
I hope that qemu will provide more info on errors in the future. If we had a log about the failure I/O it could be helpful.
While we wait for RFE, the proposal here is to revise the out of the box behavior for LUNs. For LUNs, we should propagate the errors to VM rather than directly stopping those. This will allow us to handle short-term multipath outages and improve availability. This is a simple change in behavior but will have good positive impact. I would like to seek feedback about this to make sure that everyone is ok with the proposal. I think it makes sense, but this is just a default, and it cannot work for all cases.
This can end in broken VM with read only file system that must be rebooted, while with error_policy="stop", failover may be transparent to the VM even if it was paused for a short time.
I would start by making engine defaults configurable using engine config, so different oVirt distributions can use different defaults.
Nir

On Mon, Aug 10, 2020 at 11:53 AM Kevin Wolf <kwolf@redhat.com> wrote:
Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:
On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote:
Thanks for the feedback Nir.
I agree in general that having an additional engine config for disk level error handling default would be the right way. It would be good
to
decide the granularity. Would it make sense to have this for a specific disk type like lun or would you prefer to make it generic for all types?
This must be for a specific disk type, since for thin images on block storage we cannot support propagating errors to the guest. This will break thin provisioning.
Is werror=enospc not enough for thin provisioning to work? This will still stop the guest for any other kinds of I/O errors.
Right, this should work, and what we actually use now for propagating errors for anything but cdrom. For LUN using werror=enospc,rerror=enospc seems wrong, but we do this for many years. This is how we handle cdrom: -device ide-cd,bus=ide.2,id=ua-346e176c-f983-4510-af4b-786b368efdd6,bootindex=2,werror=report,rerror=report Image: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.5,addr=0x0,drive=libvirt-2-format,id=ua-1d93fa9e-1665-40d7-9ffc-770513242795,bootindex=1,write-cache=on,serial=1d93fa9e-1665-40d7-9ffc-770513242795,werror=stop,rerror=stop \ LUN: -device virtio-blk-pci,iothread=iothread2,scsi=off,bus=pci.7,addr=0x0,drive=libvirt-1-format,id=ua-19b06845-2c54-422d-921b-6ec0ee2e935b,write-cache=on,werror=stop,rerror=stop \ Kevin, any reason not to use werror=report,rerror=report for LUN when we want to propagate errors to the guest?
Kevin
Handling the LUN use case first seems like the best way, since in this case we don't manage the LUN and we don't support resuming paused using LUNs yet, so propagating the error may be more useful.
Managed Block Storage (cinderlib based disks) are very much like direct LUN. In this case we do manage the disks on the server, but otherwise we don't support anything on the host (e.g. monitoring, resuming paused VMs) so propagating the error like direct LUNs may be more useful.
Images are a bigger problem since thin disks cannot support propagating errors but preallocated disks can. But once you create a snapshot prealocated disks behave exactly like thin disks because they are the same.
Snapshots are also created automatically in for preallocated images, for example during live storage migration, and deleted automatically after the migration. So you cannot assume that having only preallocated disks is good for propagating errors.
Even if you limit this option to file based storage, this is going to break when you migrate the disks to block storage.
Nir
Thanks,
Shubha
On 7/28/2020 2:03 PM, Nir Soffer wrote:
On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote:
Hello,
In OVirt, we have a property propagate_error at the disk level that decides in case of an error, how this error be propagated to the VM. This value is maintained in the database table with the default
set as Off. The default setting(Off) results in a policy that ends up pausing the VM rather than propagating the errors to VM. There is no provision in the UI currently to configure this property for disk (images or luns). So there is no easy way to set this value. Further, even if the value is manually set to "On" in db, it gets overwriiten by UI everytime some other property is updated as described here - https://bugzilla.redhat.com/show_bug.cgi?id=1669367
Setting the value to "Off" is not ideal for multipath devices where a single path failure causes vm to pause. Single path failure should be transparent to qemu. multipath will fail over the I/O to another path. The I/O will fail only if all paths are down, and (with the default configuration), multipath path checkers failed 4 times.
It puts serious restrictions for the DR situation and unlike VMWare * Hyper-V, oVirt is not able to support the DR functionality - https://bugzilla.redhat.com/show_bug.cgi?id=1314160 Alghouth in this bug we see that failover that looks successful from multipath and vdsm point of view ended in paused VM: https://bugzilla.redhat.com/1860377
Maybe Ben can explain how this can happen.
I hope that qemu will provide more info on errors in the future. If we had a log about the failure I/O it could be helpful.
While we wait for RFE, the proposal here is to revise the out of
behavior for LUNs. For LUNs, we should propagate the errors to VM rather than directly stopping those. This will allow us to handle short-term multipath outages and improve availability. This is a simple change in behavior but will have good positive impact. I would like to seek feedback about this to make sure that everyone is ok with the
value the box proposal.
I think it makes sense, but this is just a default, and it cannot work for all cases.
This can end in broken VM with read only file system that must be rebooted, while with error_policy="stop", failover may be transparent to the VM even if it was paused for a short time.
I would start by making engine defaults configurable using engine config, so different oVirt distributions can use different defaults.
Nir

Am 11.08.2020 um 17:44 hat Nir Soffer geschrieben:
On Mon, Aug 10, 2020 at 11:53 AM Kevin Wolf <kwolf@redhat.com> wrote:
Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:
On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote:
Thanks for the feedback Nir.
I agree in general that having an additional engine config for disk level error handling default would be the right way. It would be good
to
decide the granularity. Would it make sense to have this for a specific disk type like lun or would you prefer to make it generic for all types?
This must be for a specific disk type, since for thin images on block storage we cannot support propagating errors to the guest. This will break thin provisioning.
Is werror=enospc not enough for thin provisioning to work? This will still stop the guest for any other kinds of I/O errors.
Right, this should work, and what we actually use now for propagating errors for anything but cdrom.
Hm, wait, the options you quote below are all either 'stop' or 'report', but never 'enospc'. Is 'enospc' used for yet another kind of disk?
For LUN using werror=enospc,rerror=enospc seems wrong, but we do this for many years.
This is how we handle cdrom:
-device ide-cd,bus=ide.2,id=ua-346e176c-f983-4510-af4b-786b368efdd6,bootindex=2,werror=report,rerror=report
Makes sense to me. This is read-only and removable media. Stopping the guest usually makes sense so that it won't assume the disk is broken, but if it happens with removable media, you can just eject and re-insert the same image and it's fixed.
Image:
-device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.5,addr=0x0,drive=libvirt-2-format,id=ua-1d93fa9e-1665-40d7-9ffc-770513242795,bootindex=1,write-cache=on,serial=1d93fa9e-1665-40d7-9ffc-770513242795,werror=stop,rerror=stop
I assume this is the one that could use 'enospc'?
LUN:
-device virtio-blk-pci,iothread=iothread2,scsi=off,bus=pci.7,addr=0x0,drive=libvirt-1-format,id=ua-19b06845-2c54-422d-921b-6ec0ee2e935b,write-cache=on,werror=stop,rerror=stop \
Kevin, any reason not to use werror=report,rerror=report for LUN when we want to propagate errors to the guest?
If you want to propagate errors, then 'report' is the right setting. What does "LUN" mean exactly? It doesn't seem to be passthrough, so is it just that you have some restriction like that it's always raw? Maybe I would use 'enospc' for consistency even though you never expect this error to happen. But 'report' is fine, too. Of course, if you ever get an I/O error (e.g. network temporarily down), propagating errors to the guest means that it will give up on the disk. Whether this is the desired behaviour should probably be configured by the user. Kevin
Kevin
Handling the LUN use case first seems like the best way, since in this case we don't manage the LUN and we don't support resuming paused using LUNs yet, so propagating the error may be more useful.
Managed Block Storage (cinderlib based disks) are very much like direct LUN. In this case we do manage the disks on the server, but otherwise we don't support anything on the host (e.g. monitoring, resuming paused VMs) so propagating the error like direct LUNs may be more useful.
Images are a bigger problem since thin disks cannot support propagating errors but preallocated disks can. But once you create a snapshot prealocated disks behave exactly like thin disks because they are the same.
Snapshots are also created automatically in for preallocated images, for example during live storage migration, and deleted automatically after the migration. So you cannot assume that having only preallocated disks is good for propagating errors.
Even if you limit this option to file based storage, this is going to break when you migrate the disks to block storage.
Nir
Thanks,
Shubha
On 7/28/2020 2:03 PM, Nir Soffer wrote:
On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote:
Hello,
In OVirt, we have a property propagate_error at the disk level that decides in case of an error, how this error be propagated to the VM. This value is maintained in the database table with the default
set as Off. The default setting(Off) results in a policy that ends up pausing the VM rather than propagating the errors to VM. There is no provision in the UI currently to configure this property for disk (images or luns). So there is no easy way to set this value. Further, even if the value is manually set to "On" in db, it gets overwriiten by UI everytime some other property is updated as described here - https://bugzilla.redhat.com/show_bug.cgi?id=1669367
Setting the value to "Off" is not ideal for multipath devices where a single path failure causes vm to pause. Single path failure should be transparent to qemu. multipath will fail over the I/O to another path. The I/O will fail only if all paths are down, and (with the default configuration), multipath path checkers failed 4 times.
It puts serious restrictions for the DR situation and unlike VMWare * Hyper-V, oVirt is not able to support the DR functionality - https://bugzilla.redhat.com/show_bug.cgi?id=1314160 Alghouth in this bug we see that failover that looks successful from multipath and vdsm point of view ended in paused VM: https://bugzilla.redhat.com/1860377
Maybe Ben can explain how this can happen.
I hope that qemu will provide more info on errors in the future. If we had a log about the failure I/O it could be helpful.
While we wait for RFE, the proposal here is to revise the out of
behavior for LUNs. For LUNs, we should propagate the errors to VM rather than directly stopping those. This will allow us to handle short-term multipath outages and improve availability. This is a simple change in behavior but will have good positive impact. I would like to seek feedback about this to make sure that everyone is ok with the
value the box proposal.
I think it makes sense, but this is just a default, and it cannot work for all cases.
This can end in broken VM with read only file system that must be rebooted, while with error_policy="stop", failover may be transparent to the VM even if it was paused for a short time.
I would start by making engine defaults configurable using engine config, so different oVirt distributions can use different defaults.
Nir

On Tue, Aug 11, 2020 at 7:21 PM Kevin Wolf <kwolf@redhat.com> wrote:
Am 11.08.2020 um 17:44 hat Nir Soffer geschrieben:
On Mon, Aug 10, 2020 at 11:53 AM Kevin Wolf <kwolf@redhat.com> wrote:
Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:
On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote:
Thanks for the feedback Nir.
I agree in general that having an additional engine config for disk level error handling default would be the right way. It would be
good to
decide the granularity. Would it make sense to have this for a specific disk type like lun or would you prefer to make it generic for all types?
This must be for a specific disk type, since for thin images on block storage we cannot support propagating errors to the guest. This will break thin provisioning.
Is werror=enospc not enough for thin provisioning to work? This will still stop the guest for any other kinds of I/O errors.
Right, this should work, and what we actually use now for propagating errors for anything but cdrom.
Hm, wait, the options you quote below are all either 'stop' or 'report', but never 'enospc'. Is 'enospc' used for yet another kind of disk?
Currently as a user there is no good way to get enospc, this is what Shubha is trying to fix.
For LUN using werror=enospc,rerror=enospc seems wrong, but we do this for
many years.
This is how we handle cdrom:
-device
ide-cd,bus=ide.2,id=ua-346e176c-f983-4510-af4b-786b368efdd6,bootindex=2,werror=report,rerror=report
Makes sense to me. This is read-only and removable media. Stopping the guest usually makes sense so that it won't assume the disk is broken, but if it happens with removable media, you can just eject and re-insert the same image and it's fixed.
BTW this was changed since users typically leave cdrom attached with otherwise unused ISO storage domain (NFS). When the NFS server breaks the VM was stopped.
Image:
-device
virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.5,addr=0x0,drive=libvirt-2-format,id=ua-1d93fa9e-1665-40d7-9ffc-770513242795,bootindex=1,write-cache=on,serial=1d93fa9e-1665-40d7-9ffc-770513242795,werror=stop,rerror=stop
I assume this is the one that could use 'enospc'?
Yes, if we propagate errors, this will become werror=enospc,rerror=enospc
LUN:
-device
virtio-blk-pci,iothread=iothread2,scsi=off,bus=pci.7,addr=0x0,drive=libvirt-1-format,id=ua-19b06845-2c54-422d-921b-6ec0ee2e935b,write-cache=on,werror=stop,rerror=stop
\
Kevin, any reason not to use werror=report,rerror=report for LUN when we want to propagate errors to the guest?
If you want to propagate errors, then 'report' is the right setting.
What does "LUN" mean exactly?
When we attach a multipath device, this is called "Direct LUN" in oVirt. The underlying device can iSCSI or FC, managed by the user, or managed by Cindelib. We have 3 options: 1. As virtio or virtio-scsi <disk type='block' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native' iothread='1'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='vdb' bus='virtio'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/> </disk> -blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,write-cache=on,werror=stop,rerror=stop \ 2. same + passthrough <disk type='block' device='lun' sgio='filtered' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> -blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device scsi-block,bus=ua-50240806-3d5a-4e5b-a220-bc394698a641.0,channel=0,scsi-id=0,lun=0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,werror=stop,rerror=stop \ 3. same + privileged I/O <disk type='block' device='lun' sgio='unfiltered' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> -blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device scsi-block,bus=ua-9c2c7e43-d32d-4ea4-9cfd-e2bb36d26fdb.0,channel=0,scsi-id=0,lun=0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,werror=stop,rerror=stop \ It doesn't seem to be passthrough, so is
it just that you have some restriction like that it's always raw?
Yes, we don't support qcow2 for these disks (yet). Theoretically we can support qcow2 to enable incremental backup. But the qcow2 image will never be larger than the block device, actually smaller to leave room for metadata. Thin provisioning is done on the storage side.
Maybe I would use 'enospc' for consistency even though you never expect this error to happen. But 'report' is fine, too.
enospc looks wrong since this error should not be possible, and if it happens we cannot handle it. Sounds like good way to confuse futures maintainers of this code. Maybe libvirt or qemu did not support "report" when this code was added in 2010? Of course, if you ever get an I/O error (e.g. network temporarily down),
propagating errors to the guest means that it will give up on the disk. Whether this is the desired behaviour should probably be configured by the user.
Kevin
Kevin
Handling the LUN use case first seems like the best way, since in
case we
don't manage the LUN and we don't support resuming paused using LUNs yet, so propagating the error may be more useful.
Managed Block Storage (cinderlib based disks) are very much like direct LUN. In this case we do manage the disks on the server, but otherwise we don't support anything on the host (e.g. monitoring, resuming paused VMs) so propagating the error like direct LUNs may be more useful.
Images are a bigger problem since thin disks cannot support propagating errors but preallocated disks can. But once you create a snapshot prealocated disks behave exactly like thin disks because they are the same.
Snapshots are also created automatically in for preallocated images, for example during live storage migration, and deleted automatically after the migration. So you cannot assume that having only preallocated disks is good for propagating errors.
Even if you limit this option to file based storage, this is going to break when you migrate the disks to block storage.
Nir
Thanks,
Shubha
On 7/28/2020 2:03 PM, Nir Soffer wrote:
On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote: > Hello, > > In OVirt, we have a property propagate_error at the disk level
> decides in case of an error, how this error be propagated to
this that the VM.
> This value is maintained in the database table with the default value > set as Off. The default setting(Off) results in a policy that ends up > pausing the VM rather than propagating the errors to VM. There is no > provision in the UI currently to configure this property for disk > (images or luns). So there is no easy way to set this value. Further, > even if the value is manually set to "On" in db, it gets overwriiten by > UI everytime some other property is updated as described here - > https://bugzilla.redhat.com/show_bug.cgi?id=1669367 > > Setting the value to "Off" is not ideal for multipath devices where a > single path failure causes vm to pause. Single path failure should be transparent to qemu. multipath will fail over the I/O to another path. The I/O will fail only if all paths are down, and (with the default configuration), multipath path checkers failed 4 times.
> It puts serious restrictions for > the DR situation and unlike VMWare * Hyper-V, oVirt is not able to > support the DR functionality - > https://bugzilla.redhat.com/show_bug.cgi?id=1314160 Alghouth in this bug we see that failover that looks successful from multipath and vdsm point of view ended in paused VM: https://bugzilla.redhat.com/1860377
Maybe Ben can explain how this can happen.
I hope that qemu will provide more info on errors in the future. If we had a log about the failure I/O it could be helpful.
> While we wait for RFE, the proposal here is to revise the out of the box > behavior for LUNs. For LUNs, we should propagate the errors to VM rather > than directly stopping those. This will allow us to handle short-term > multipath outages and improve availability. This is a simple change in > behavior but will have good positive impact. I would like to seek > feedback about this to make sure that everyone is ok with the proposal. I think it makes sense, but this is just a default, and it cannot work for all cases.
This can end in broken VM with read only file system that must be rebooted, while with error_policy="stop", failover may be transparent to the VM even if it was paused for a short time.
I would start by making engine defaults configurable using engine config, so different oVirt distributions can use different defaults.
Nir

Am 11.08.2020 um 19:22 hat Nir Soffer geschrieben:
On Tue, Aug 11, 2020 at 7:21 PM Kevin Wolf <kwolf@redhat.com> wrote:
Am 11.08.2020 um 17:44 hat Nir Soffer geschrieben:
On Mon, Aug 10, 2020 at 11:53 AM Kevin Wolf <kwolf@redhat.com> wrote:
Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:
On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote:
Thanks for the feedback Nir.
I agree in general that having an additional engine config for disk level error handling default would be the right way. It would be
good to
decide the granularity. Would it make sense to have this for a specific disk type like lun or would you prefer to make it generic for all types?
This must be for a specific disk type, since for thin images on block storage we cannot support propagating errors to the guest. This will break thin provisioning.
Is werror=enospc not enough for thin provisioning to work? This will still stop the guest for any other kinds of I/O errors.
Right, this should work, and what we actually use now for propagating errors for anything but cdrom.
Hm, wait, the options you quote below are all either 'stop' or 'report', but never 'enospc'. Is 'enospc' used for yet another kind of disk?
Currently as a user there is no good way to get enospc, this is what Shubha is trying to fix.
For LUN using werror=enospc,rerror=enospc seems wrong, but we do this for
many years.
This is how we handle cdrom:
-device
ide-cd,bus=ide.2,id=ua-346e176c-f983-4510-af4b-786b368efdd6,bootindex=2,werror=report,rerror=report
Makes sense to me. This is read-only and removable media. Stopping the guest usually makes sense so that it won't assume the disk is broken, but if it happens with removable media, you can just eject and re-insert the same image and it's fixed.
BTW this was changed since users typically leave cdrom attached with otherwise unused ISO storage domain (NFS). When the NFS server breaks the VM was stopped.
Image:
-device
virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.5,addr=0x0,drive=libvirt-2-format,id=ua-1d93fa9e-1665-40d7-9ffc-770513242795,bootindex=1,write-cache=on,serial=1d93fa9e-1665-40d7-9ffc-770513242795,werror=stop,rerror=stop
I assume this is the one that could use 'enospc'?
Yes, if we propagate errors, this will become werror=enospc,rerror=enospc
LUN:
-device
virtio-blk-pci,iothread=iothread2,scsi=off,bus=pci.7,addr=0x0,drive=libvirt-1-format,id=ua-19b06845-2c54-422d-921b-6ec0ee2e935b,write-cache=on,werror=stop,rerror=stop
\
Kevin, any reason not to use werror=report,rerror=report for LUN when we want to propagate errors to the guest?
If you want to propagate errors, then 'report' is the right setting.
What does "LUN" mean exactly?
When we attach a multipath device, this is called "Direct LUN" in oVirt. The underlying device can iSCSI or FC, managed by the user, or managed by Cindelib.
We have 3 options:
1. As virtio or virtio-scsi
<disk type='block' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native' iothread='1'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='vdb' bus='virtio'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/> </disk>
-blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,write-cache=on,werror=stop,rerror=stop \
2. same + passthrough
<disk type='block' device='lun' sgio='filtered' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk>
-blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device scsi-block,bus=ua-50240806-3d5a-4e5b-a220-bc394698a641.0,channel=0,scsi-id=0,lun=0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,werror=stop,rerror=stop \
3. same + privileged I/O
<disk type='block' device='lun' sgio='unfiltered' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk>
-blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device scsi-block,bus=ua-9c2c7e43-d32d-4ea4-9cfd-e2bb36d26fdb.0,channel=0,scsi-id=0,lun=0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,werror=stop,rerror=stop \
It doesn't seem to be passthrough, so is
it just that you have some restriction like that it's always raw?
Yes, we don't support qcow2 for these disks (yet). Theoretically we can support qcow2 to enable incremental backup. But the qcow2 image will never be larger than the block device, actually smaller to leave room for metadata. Thin provisioning is done on the storage side.
Maybe I would use 'enospc' for consistency even though you never expect this error to happen. But 'report' is fine, too.
enospc looks wrong since this error should not be possible, and if it happens we cannot handle it. Sounds like good way to confuse futures maintainers of this code.
If it's a different code path anyway, then I agree. If you would have to make it a special case just for this, I'd rather add a comment.
Maybe libvirt or qemu did not support "report" when this code was added in 2010?
At least as far as QEMU is concerned, 'report' has existed as long as rerror/werror in general. If you're interested, commit 428c570512c added it in 2009. Kevin

Okay. So here is my understanding/summary based on the discussion. Disk Storage Type Current Behavior (propagate error = On) Current Behavior (propagate error = Off) Recommended Changes (propagate error = On) Recommended Changes (propagate error = Off) Image Error policy – enospace Error policy - stop No change No change Lun Error policy - enospace Error policy - stop *Error policy - report* No change Cinder Error policy - stop Error policy - stop *Error policy - report* No change Managed Block Storage Use default Use default No change No change Can you please confirm? Thanks, Shubha On 8/11/2020 1:34 PM, Kevin Wolf wrote:
Am 11.08.2020 um 19:22 hat Nir Soffer geschrieben:
On Tue, Aug 11, 2020 at 7:21 PM Kevin Wolf <kwolf@redhat.com> wrote:
Am 11.08.2020 um 17:44 hat Nir Soffer geschrieben:
On Mon, Aug 10, 2020 at 11:53 AM Kevin Wolf <kwolf@redhat.com> wrote:
Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:
On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni <shubha.kulkarni@oracle.com> wrote: > Thanks for the feedback Nir. > > I agree in general that having an additional engine config for disk > level error handling default would be the right way. It would be good to > decide the granularity. Would it make sense to have this for a specific > disk type like lun or would you prefer to make it generic for all types? This must be for a specific disk type, since for thin images on block storage we cannot support propagating errors to the guest. This will break thin provisioning. Is werror=enospc not enough for thin provisioning to work? This will still stop the guest for any other kinds of I/O errors.
Right, this should work, and what we actually use now for propagating errors for anything but cdrom. Hm, wait, the options you quote below are all either 'stop' or 'report', but never 'enospc'. Is 'enospc' used for yet another kind of disk?
Currently as a user there is no good way to get enospc, this is what Shubha is trying to fix.
For LUN using werror=enospc,rerror=enospc seems wrong, but we do this for
many years.
This is how we handle cdrom:
-device
ide-cd,bus=ide.2,id=ua-346e176c-f983-4510-af4b-786b368efdd6,bootindex=2,werror=report,rerror=report
Makes sense to me. This is read-only and removable media. Stopping the guest usually makes sense so that it won't assume the disk is broken, but if it happens with removable media, you can just eject and re-insert the same image and it's fixed.
BTW this was changed since users typically leave cdrom attached with otherwise unused ISO storage domain (NFS). When the NFS server breaks the VM was stopped.
Image:
-device
virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.5,addr=0x0,drive=libvirt-2-format,id=ua-1d93fa9e-1665-40d7-9ffc-770513242795,bootindex=1,write-cache=on,serial=1d93fa9e-1665-40d7-9ffc-770513242795,werror=stop,rerror=stop
I assume this is the one that could use 'enospc'?
Yes, if we propagate errors, this will become werror=enospc,rerror=enospc
LUN:
-device
virtio-blk-pci,iothread=iothread2,scsi=off,bus=pci.7,addr=0x0,drive=libvirt-1-format,id=ua-19b06845-2c54-422d-921b-6ec0ee2e935b,write-cache=on,werror=stop,rerror=stop
\
Kevin, any reason not to use werror=report,rerror=report for LUN when we want to propagate errors to the guest? If you want to propagate errors, then 'report' is the right setting.
What does "LUN" mean exactly?
When we attach a multipath device, this is called "Direct LUN" in oVirt. The underlying device can iSCSI or FC, managed by the user, or managed by Cindelib.
We have 3 options:
1. As virtio or virtio-scsi
<disk type='block' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native' iothread='1'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='vdb' bus='virtio'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/> </disk>
-blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,write-cache=on,werror=stop,rerror=stop \
2. same + passthrough
<disk type='block' device='lun' sgio='filtered' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk>
-blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device scsi-block,bus=ua-50240806-3d5a-4e5b-a220-bc394698a641.0,channel=0,scsi-id=0,lun=0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,werror=stop,rerror=stop \
3. same + privileged I/O
<disk type='block' device='lun' sgio='unfiltered' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk>
-blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device scsi-block,bus=ua-9c2c7e43-d32d-4ea4-9cfd-e2bb36d26fdb.0,channel=0,scsi-id=0,lun=0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,werror=stop,rerror=stop \
It doesn't seem to be passthrough, so is
it just that you have some restriction like that it's always raw?
Yes, we don't support qcow2 for these disks (yet). Theoretically we can support qcow2 to enable incremental backup. But the qcow2 image will never be larger than the block device, actually smaller to leave room for metadata. Thin provisioning is done on the storage side.
Maybe I would use 'enospc' for consistency even though you never expect this error to happen. But 'report' is fine, too.
enospc looks wrong since this error should not be possible, and if it happens we cannot handle it. Sounds like good way to confuse futures maintainers of this code. If it's a different code path anyway, then I agree. If you would have to make it a special case just for this, I'd rather add a comment.
Maybe libvirt or qemu did not support "report" when this code was added in 2010? At least as far as QEMU is concerned, 'report' has existed as long as rerror/werror in general. If you're interested, commit 428c570512c added it in 2009.
Kevin

Based on the feedback, I changed the code now such that we read the PropagateErrors config at the very first time disk is created (BaseImage Constructor) and the disk property is set accordingly. This change is very straightforward but I am stuck with a lots of unit test failures and it is becoming a challenging journey to fix all the issues. First of all, I realized that the majority of the disk related tests are written with no mock configuration setup. So when the tests are run the config is never initialized. But since the now code tries to use the config, the test simply fails. Once I fixed the first test, the next case fails and I am literally spending hours to understand the tests and get those fixed. One particular testcase scenario is pretty unique. There is a parameterizedTest (StorageDomainValidatorFreeSpaceTest) where the function that create parameters (createParam)is called "before" individual tests. This function has code that ends up adding a new disk snapshot via a copyOf method. The MockConfigExtension is the standard way for us to add config. I found that there is no way to inject config before the parameters are created. I have rearranged the code to workaround the issue here. Anyways, I thought I will ask if there is a better way to accomplish the goal here because it has very challenging and time consuming problem I am running into with the tests. Thanks Shubha From: Shubha Kulkarni Sent: Tuesday, August 11, 2020 6:46 PM To: Kevin Wolf <kwolf@redhat.com>; Nir Soffer <nsoffer@redhat.com> Cc: devel <devel@ovirt.org>; Simon Coter <simon.coter@oracle.com>; Benjamin Marzinski <bmarzins@redhat.com>; greg King <greg.king@oracle.com>; Pierre Lecomte <pierre.lecomte@oracle.com> Subject: [ovirt-devel] Re: Improving VM behavior in case of IO errors Okay. So here is my understanding/summary based on the discussion. Disk Storage Type Current Behavior (propagate error = On) Current Behavior (propagate error = Off) Recommended Changes (propagate error = On) Recommended Changes (propagate error = Off) Image Error policy – enospace Error policy - stop No change No change Lun Error policy - enospace Error policy - stop Error policy - report No change Cinder Error policy - stop Error policy - stop Error policy - report No change Managed Block Storage Use default Use default No change No change Can you please confirm? Thanks, Shubha On 8/11/2020 1:34 PM, Kevin Wolf wrote: Am 11.08.2020 um 19:22 hat Nir Soffer geschrieben: On Tue, Aug 11, 2020 at 7:21 PM Kevin Wolf HYPERLINK "mailto:kwolf@redhat.com"<kwolf@redhat.com> wrote: Am 11.08.2020 um 17:44 hat Nir Soffer geschrieben: On Mon, Aug 10, 2020 at 11:53 AM Kevin Wolf HYPERLINK "mailto:kwolf@redhat.com"<kwolf@redhat.com> wrote: Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben: On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni HYPERLINK "mailto:shubha.kulkarni@oracle.com"<shubha.kulkarni@oracle.com> wrote: Thanks for the feedback Nir. I agree in general that having an additional engine config for disk level error handling default would be the right way. It would be good to decide the granularity. Would it make sense to have this for a specific disk type like lun or would you prefer to make it generic for all types? This must be for a specific disk type, since for thin images on block storage we cannot support propagating errors to the guest. This will break thin provisioning. Is werror=enospc not enough for thin provisioning to work? This will still stop the guest for any other kinds of I/O errors. Right, this should work, and what we actually use now for propagating errors for anything but cdrom. Hm, wait, the options you quote below are all either 'stop' or 'report', but never 'enospc'. Is 'enospc' used for yet another kind of disk? Currently as a user there is no good way to get enospc, this is what Shubha is trying to fix. For LUN using werror=enospc,rerror=enospc seems wrong, but we do this for many years. This is how we handle cdrom: -device ide-cd,bus=ide.2,id=ua-346e176c-f983-4510-af4b-786b368efdd6,bootindex=2,werror=report,rerror=report Makes sense to me. This is read-only and removable media. Stopping the guest usually makes sense so that it won't assume the disk is broken, but if it happens with removable media, you can just eject and re-insert the same image and it's fixed. BTW this was changed since users typically leave cdrom attached with otherwise unused ISO storage domain (NFS). When the NFS server breaks the VM was stopped. Image: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.5,addr=0x0,drive=libvirt-2-format,id=ua-1d93fa9e-1665-40d7-9ffc-770513242795,bootindex=1,write-cache=on,serial=1d93fa9e-1665-40d7-9ffc-770513242795,werror=stop,rerror=stop I assume this is the one that could use 'enospc'? Yes, if we propagate errors, this will become werror=enospc,rerror=enospc LUN: -device virtio-blk-pci,iothread=iothread2,scsi=off,bus=pci.7,addr=0x0,drive=libvirt-1-format,id=ua-19b06845-2c54-422d-921b-6ec0ee2e935b,write-cache=on,werror=stop,rerror=stop \ Kevin, any reason not to use werror=report,rerror=report for LUN when we want to propagate errors to the guest? If you want to propagate errors, then 'report' is the right setting. What does "LUN" mean exactly? When we attach a multipath device, this is called "Direct LUN" in oVirt. The underlying device can iSCSI or FC, managed by the user, or managed by Cindelib. We have 3 options: 1. As virtio or virtio-scsi <disk type='block' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native' iothread='1'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='vdb' bus='virtio'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/> </disk> -blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,write-cache=on,werror=stop,rerror=stop \ 2. same + passthrough <disk type='block' device='lun' sgio='filtered' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> -blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device scsi-block,bus=ua-50240806-3d5a-4e5b-a220-bc394698a641.0,channel=0,scsi-id=0,lun=0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,werror=stop,rerror=stop \ 3. same + privileged I/O <disk type='block' device='lun' sgio='unfiltered' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/> <source dev='/dev/mapper/360014058657c2a1941841348f19c1a50' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> -blockdev '{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ -device scsi-block,bus=ua-9c2c7e43-d32d-4ea4-9cfd-e2bb36d26fdb.0,channel=0,scsi-id=0,lun=0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,werror=stop,rerror=stop \ It doesn't seem to be passthrough, so is it just that you have some restriction like that it's always raw? Yes, we don't support qcow2 for these disks (yet). Theoretically we can support qcow2 to enable incremental backup. But the qcow2 image will never be larger than the block device, actually smaller to leave room for metadata. Thin provisioning is done on the storage side. Maybe I would use 'enospc' for consistency even though you never expect this error to happen. But 'report' is fine, too. enospc looks wrong since this error should not be possible, and if it happens we cannot handle it. Sounds like good way to confuse futures maintainers of this code. If it's a different code path anyway, then I agree. If you would have to make it a special case just for this, I'd rather add a comment. Maybe libvirt or qemu did not support "report" when this code was added in 2010? At least as far as QEMU is concerned, 'report' has existed as long as rerror/werror in general. If you're interested, commit 428c570512c added it in 2009. Kevin
participants (3)
-
Kevin Wolf
-
Nir Soffer
-
Shubha Kulkarni