On Tue, Aug 11, 2020 at 7:21 PM Kevin Wolf <kwolf(a)redhat.com> wrote:
Am 11.08.2020 um 17:44 hat Nir Soffer geschrieben:
> On Mon, Aug 10, 2020 at 11:53 AM Kevin Wolf <kwolf(a)redhat.com> wrote:
>
> > Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:
> > > On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni
> > > <shubha.kulkarni(a)oracle.com> wrote:
> > > >
> > > > Thanks for the feedback Nir.
> > > >
> > > > I agree in general that having an additional engine config for disk
> > > > level error handling default would be the right way. It would be
good
> > to
> > > > decide the granularity. Would it make sense to have this for a
specific
> > > > disk type like lun or would you prefer to make it generic for all
> > types?
> > >
> > > This must be for a specific disk type, since for thin images on block
> > > storage we cannot support propagating errors to the guest. This will
> > > break thin provisioning.
> >
> > Is werror=enospc not enough for thin provisioning to work? This will
> > still stop the guest for any other kinds of I/O errors.
> >
>
> Right, this should work, and what we actually use now for propagating
> errors for anything but cdrom.
Hm, wait, the options you quote below are all either 'stop' or 'report',
but never 'enospc'. Is 'enospc' used for yet another kind of disk?
Currently as a user there is no good way to get enospc, this is what Shubha
is trying to fix.
For LUN using werror=enospc,rerror=enospc seems wrong, but we do this
for
> many years.
>
> This is how we handle cdrom:
>
> -device
>
ide-cd,bus=ide.2,id=ua-346e176c-f983-4510-af4b-786b368efdd6,bootindex=2,werror=report,rerror=report
Makes sense to me. This is read-only and removable media. Stopping the
guest usually makes sense so that it won't assume the disk is broken,
but if it happens with removable media, you can just eject and re-insert
the same image and it's fixed.
BTW this was changed since users typically leave cdrom attached
with otherwise unused ISO storage domain (NFS). When the NFS server breaks
the VM was stopped.
Image:
>
> -device
>
virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.5,addr=0x0,drive=libvirt-2-format,id=ua-1d93fa9e-1665-40d7-9ffc-770513242795,bootindex=1,write-cache=on,serial=1d93fa9e-1665-40d7-9ffc-770513242795,werror=stop,rerror=stop
I assume this is the one that could use 'enospc'?
Yes, if we propagate errors, this will become werror=enospc,rerror=enospc
>
> LUN:
>
> -device
>
virtio-blk-pci,iothread=iothread2,scsi=off,bus=pci.7,addr=0x0,drive=libvirt-1-format,id=ua-19b06845-2c54-422d-921b-6ec0ee2e935b,write-cache=on,werror=stop,rerror=stop
> \
>
> Kevin, any reason not to use werror=report,rerror=report for LUN when
> we want to propagate errors to the guest?
If you want to propagate errors, then 'report' is the right setting.
What does "LUN" mean exactly?
When we attach a multipath device, this is called "Direct LUN" in oVirt.
The underlying device can iSCSI or FC, managed by the user, or managed by
Cindelib.
We have 3 options:
1. As virtio or virtio-scsi
<disk type='block' device='disk' snapshot='no'>
<driver name='qemu' type='raw' cache='none'
error_policy='stop'
io='native' iothread='1'/>
<source dev='/dev/mapper/360014058657c2a1941841348f19c1a50'
index='1'>
<seclabel model='dac' relabel='no'/>
</source>
<backingStore/>
<target dev='vdb' bus='virtio'/>
<alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/>
<address type='pci' domain='0x0000' bus='0x06'
slot='0x00'
function='0x0'/>
</disk>
-blockdev
'{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}'
\
-blockdev
'{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}'
\
-device
virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,write-cache=on,werror=stop,rerror=stop
\
2. same + passthrough
<disk type='block' device='lun' sgio='filtered'
snapshot='no'>
<driver name='qemu' type='raw' cache='none'
error_policy='stop'
io='native'/>
<source dev='/dev/mapper/360014058657c2a1941841348f19c1a50'
index='1'>
<seclabel model='dac' relabel='no'/>
</source>
<backingStore/>
<target dev='sda' bus='scsi'/>
<alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/>
<address type='drive' controller='0' bus='0'
target='0' unit='0'/>
</disk>
-blockdev
'{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}'
\
-blockdev
'{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}'
\
-device
scsi-block,bus=ua-50240806-3d5a-4e5b-a220-bc394698a641.0,channel=0,scsi-id=0,lun=0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,werror=stop,rerror=stop
\
3. same + privileged I/O
<disk type='block' device='lun' sgio='unfiltered'
snapshot='no'>
<driver name='qemu' type='raw' cache='none'
error_policy='stop'
io='native'/>
<source dev='/dev/mapper/360014058657c2a1941841348f19c1a50'
index='1'>
<seclabel model='dac' relabel='no'/>
</source>
<backingStore/>
<target dev='sda' bus='scsi'/>
<alias name='ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b'/>
<address type='drive' controller='0' bus='0'
target='0' unit='0'/>
</disk>
-blockdev
'{"driver":"host_device","filename":"/dev/mapper/360014058657c2a1941841348f19c1a50","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}'
\
-blockdev
'{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}'
\
-device
scsi-block,bus=ua-9c2c7e43-d32d-4ea4-9cfd-e2bb36d26fdb.0,channel=0,scsi-id=0,lun=0,drive=libvirt-1-format,id=ua-c1bf9168-00f0-422f-a190-9ddf6bcd449b,werror=stop,rerror=stop
\
It doesn't seem to be passthrough, so is
it just that you have some restriction like that it's always raw?
Yes, we don't support qcow2 for these disks (yet). Theoretically we can
support
qcow2 to enable incremental backup. But the qcow2 image will never be larger
than the block device, actually smaller to leave room for metadata.
Thin provisioning is done on the storage side.
Maybe
I would use 'enospc' for consistency even though you never expect this
error to happen. But 'report' is fine, too.
enospc looks wrong since this error should not be possible, and if it
happens
we cannot handle it. Sounds like good way to confuse futures maintainers of
this code.
Maybe libvirt or qemu did not support "report" when this code was added in
2010?
Of course, if you ever get an I/O error (e.g. network temporarily down),
propagating errors to the guest means that it will give up on the
disk.
Whether this is the desired behaviour should probably be configured by
the user.
Kevin
>
> > Kevin
> >
> > > Handling the LUN use case first seems like the best way, since in
this
> > case we
> > > don't manage the LUN and we don't support resuming paused using
LUNs
yet,
> > > so propagating the error may be more useful.
> > >
> > > Managed Block Storage (cinderlib based disks) are very much like
> > > direct LUN. In this
> > > case we do manage the disks on the server, but otherwise we don't
> > > support anything
> > > on the host (e.g. monitoring, resuming paused VMs) so propagating the
> > error like
> > > direct LUNs may be more useful.
> > >
> > > Images are a bigger problem since thin disks cannot support
> > > propagating errors but
> > > preallocated disks can. But once you create a snapshot prealocated
disks
> > behave
> > > exactly like thin disks because they are the same.
> > >
> > > Snapshots are also created automatically in for preallocated images,
> > > for example during
> > > live storage migration, and deleted automatically after the
migration.
> > > So you cannot
> > > assume that having only preallocated disks is good for propagating
> > errors.
> > >
> > > Even if you limit this option to file based storage, this is going to
> > > break when you migrate
> > > the disks to block storage.
> > >
> > > Nir
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Shubha
> > > >
> > > > On 7/28/2020 2:03 PM, Nir Soffer wrote:
> > > > > On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni
> > > > > <shubha.kulkarni(a)oracle.com> wrote:
> > > > >> Hello,
> > > > >>
> > > > >> In OVirt, we have a property propagate_error at the disk
level
that
> > > > >> decides in case of an error, how this error be propagated
to
the VM.
> > > > >> This value is maintained in the database table with the
default
> > value
> > > > >> set as Off. The default setting(Off) results in a policy
that
ends
> > up
> > > > >> pausing the VM rather than propagating the errors to VM.
There
is
> > no
> > > > >> provision in the UI currently to configure this property
for
disk
> > > > >> (images or luns). So there is no easy way to set this
value.
> > Further,
> > > > >> even if the value is manually set to "On" in db,
it gets
> > overwriiten by
> > > > >> UI everytime some other property is updated as described
here -
> > > > >>
https://bugzilla.redhat.com/show_bug.cgi?id=1669367
> > > > >>
> > > > >> Setting the value to "Off" is not ideal for
multipath devices
where
> > a
> > > > >> single path failure causes vm to pause.
> > > > > Single path failure should be transparent to qemu. multipath
will
> > fail over
> > > > > the I/O to another path. The I/O will fail only if all paths
are
> > down, and
> > > > > (with the default configuration), multipath path checkers
failed
4
> > times.
> > > > >
> > > > >> It puts serious restrictions for
> > > > >> the DR situation and unlike VMWare * Hyper-V, oVirt is not
able
to
> > > > >> support the DR functionality -
> > > > >>
https://bugzilla.redhat.com/show_bug.cgi?id=1314160
> > > > > Alghouth in this bug we see that failover that looks successful
from
> > multipath
> > > > > and vdsm point of view ended in paused VM:
> > > > >
https://bugzilla.redhat.com/1860377
> > > > >
> > > > > Maybe Ben can explain how this can happen.
> > > > >
> > > > > I hope that qemu will provide more info on errors in the
future.
If
> > we had a log
> > > > > about the failure I/O it could be helpful.
> > > > >
> > > > >> While we wait for RFE, the proposal here is to revise the
out of
> > the box
> > > > >> behavior for LUNs. For LUNs, we should propagate the errors
to
VM
> > rather
> > > > >> than directly stopping those. This will allow us to handle
> > short-term
> > > > >> multipath outages and improve availability. This is a
simple
change
> > in
> > > > >> behavior but will have good positive impact. I would like
to
seek
> > > > >> feedback about this to make sure that everyone is ok with
the
> > proposal.
> > > > > I think it makes sense, but this is just a default, and it
cannot
> > work
> > > > > for all cases.
> > > > >
> > > > > This can end in broken VM with read only file system that must
be
> > > > > rebooted, while
> > > > > with error_policy="stop", failover may be transparent
to the VM
even
> > > > > if it was paused
> > > > > for a short time.
> > > > >
> > > > > I would start by making engine defaults configurable using
engine
> > > > > config, so different
> > > > > oVirt distributions can use different defaults.
> > > > >
> > > > > Nir
> > > > >
> > > >
> > >
> >
> >