
On Wed, Jul 29, 2020 at 01:31:09 +0300, Nir Soffer wrote:
On Tue, Jul 28, 2020 at 4:05 PM Łukasz Kołaciński <l.kolacinski@storware.eu> wrote:
Hello,
Hi Łukasz
I moved the discussion to devel@ovirt.org since it is more appropriate for this issue.
After doing a few vm backups, something breaks and I am unable to perform
any operations. I cannot do incremental backups and even full backups doesn't work. I have this issue third time. I don't know how to fix this so I am currently making new vms for testing purposes
VDSM ovirt44-h2.storware.local command StartVmBackupVDS failed: Backup Error: {'vm_id': '116aa6eb-31a1-43db-9b1e-ad6e32fb9260', 'backup': <vdsm.virt.backup.BackupConfig object at 0x7f42602bba20>, 'reason': "Error starting backup: internal error: unable to execute QEMU command 'transaction': Dirty bitmap 'ef0dfe55-c08c-4d9e-ad32-d6b6d5cbdac6' not found"}
This means that libvirt cannot find the dirty bitmap when starting the backup.
When we start a backup, we get the list of checkpoints from libvirt and we redefine all checkpoints. We assume that all redefined checkpoints have a bitmap in qemu at the time of the redefine.
Peter, is this assumption correct?
Yes. I want to add that libvirt checks whether bitmaps are present when kicking off the backup job, so qemu reporting the error is weird.
If libvirt and engine agree on the existing checkpoints, we start the backup. In this case one of the bitmaps was missing, so the backup failed.
We know about some flows that may cause loss of the bitmaps: - copying disks (bitmaps are not copied yet) - live storage migration (it copy the disks) - deleting snapshots - live migration may cause this, not tested yet - unclean shutdown of the vm - storage is not accessible when vm is terminated
Did you do any of these operations on the tested vm?
Some of the issues can never be fixed, like unclean shutdown or storage issue when qemu try to write the bitmaps to disk. So your backup application must be able to recover from this error.
However oVirt does not provide a useful error that enables recovery. We have a special error when full backup is required, but it seem that this error is not returned in this case, and instead we return internal error.
Also since you cannot do a full backup after this error, I guess that engine did not delete the checkpoint with the missing bitmap. This is not surprising since the error returned from vdsm is a generic error (BackupError), so engine cannot tell what is the reason for the failure.
This is weird. Libvirt shouldn't be touching any bitmaps when doing a full backup, so that should work regardless of the bitmap state.
Did you check the backup events? What was the backup completion event?
See this example how to get backup events: https://github.com/oVirt/ovirt-engine-sdk/blob/4a143351fcd3cdb0df8c508889316... https://github.com/oVirt/ovirt-engine-sdk/blob/4a143351fcd3cdb0df8c508889316...
Eyal, looking at the API docs: http://ovirt.github.io/ovirt-engine-api-model/master/#types/event
Event code is an integer. This is not usable for detecting errors since the value is not part of the API. We need an enum like: http://ovirt.github.io/ovirt-engine-api-model/4.4/#types/image_transfer_phas...
Peter, do we have a specific error code in libvirt about missing bitmap? we need this to pass useful error to engine, and engine needs this error to pass useful error to the user.
When the error is detected by libvirt we definitely can report a specific error code. Please file a feature request for that. In this case the error is reported by qemu though and we can't do much there. But it should not have happened in the first place.
The error seen here is generated by:
flags = libvirt.VIR_DOMAIN_BACKUP_BEGIN_REUSE_EXTERNAL try: dom.backupBegin(backup_xml, checkpoint_xml, flags=flags) except libvirt.libvirtError as e: raise exception.BackupError( reason="Error starting backup: {}".format(e), vm_id=vm.id, backup=backup_cfg)
Looking in the documentation: https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin
There are no specified errors so we cannot detect the reason for the failure and return meaningful error to our caller.
Peter, how do you suggest to recover from internal errors? how can we tell if this temporary error that can succeed in the next attempt, or an error that requires starting from full backup?
Uff, that really depends on what happened. In this case we need to investigate it first.