
Hello I saw there are other thread asking how to delete disk snapshots from backup operation. We definitively need a tool to kill pending backup operations and locked snapshots. I Think this is very frustrating ovirt is a good piece of software but it's very immature in a dirty asyncronous world. We need a unified toolbox to clean manually and do database housekeeping.

On Fri, Aug 26, 2022 at 6:25 PM Diego Ercolani <diego.ercolani@ssis.sm> wrote:
Hello I saw there are other thread asking how to delete disk snapshots from backup operation. We definitively need a tool to kill pending backup operations and locked snapshots.
I Think this is very frustrating ovirt is a good piece of software but it's
very immature in a dirty asyncronous world. We need a unified toolbox to clean manually and do database housekeeping.
Note that the thread you refer to is about a snapshot-based mechanism for backup. While we still test it (and unfortunately didn't notice the reported issues on our environments, thus we need more information as Benny pointed out), we have been putting our efforts on an alternative mechanism that is based on incremental backup. This mechanism is supported (since oVirt 4.5.1 I believe) and should provide you with ways to finalize backups. It would be great if you can elaborate on what "pending backup operations" means to see it's covered by that new mechanism we call "Hybrid backup": https://www.ovirt.org/media/Hybrid-backup-v8.pdf
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/MNVW4FT3Y24ATI...

Thank you for your support, I'm councious about the difficulty to keep everything in-line. I currently try to find the correct workload to make backup (using CBR) of VMs. I tryied both vprotect (with current tecnology preview) and Veeam (community using RHV plugin) And I'm currently experiencing very annoying problems. I can give you the engine-log https://cloud.ssis.sm/index.php/s/M9DqFHSaowYqa9H, I currently have two machines in an unconsinstent stato from the snapshot point of view: SSIS-otobo and SSIS-TPayX2go. I empied the image_transfer table, last time it helped. This is the sql to restore it: INSERT INTO public.image_transfers (command_id,command_type,phase,last_updated,message,vds_id,disk_id,imaged_ticket_id,proxy_uri,bytes_sent,bytes_total,"type",active,daemon_uri,client_inactivity_timeout,image_format,backend,backup_id,client_type,shallow,timeout_policy) VALUES ('54097389-db69-4aa3-a34d-eb6cb2c1fc4b',1024,7,'2022-08-26 15:00:46.138+02',NULL,'bac4cca5-b6db-4d66-af65-39b8929262b7','5d18a058-652f-4c94-a9ff-9c15152c61b4','1e1846a1-f9f0-49e5-912e-2f5bf8dd8144','https://ovirt-engine.ovirt:54323/images',12307202048,42949640192,1,false,'https://ovirt-node3.ovirt:54322/images',3600,5,1,'7e06b6e9-92d9-4f83-ac16-9a06a638fac3',2,false,'legacy'); I'm currently stuck because I cannot even remove the VMs as they are "locked during backup" operation

On Mon, Aug 29, 2022 at 2:20 PM Diego Ercolani <diego.ercolani@ssis.sm> wrote:
Thank you for your support, I'm councious about the difficulty to keep everything in-line. I currently try to find the correct workload to make backup (using CBR) of VMs. I tryied both vprotect (with current tecnology preview) and Veeam (community using RHV plugin) And I'm currently experiencing very annoying problems. I can give you the engine-log https://cloud.ssis.sm/index.php/s/M9DqFHSaowYqa9H, I currently have two machines in an unconsinstent stato from the snapshot point of view: SSIS-otobo and SSIS-TPayX2go.
I empied the image_transfer table, last time it helped. This is the sql to
restore it: INSERT INTO public.image_transfers (command_id,command_type,phase,last_updated,message,vds_id,disk_id,imaged_ticket_id,proxy_uri,bytes_sent,bytes_total,"type",active,daemon_uri,client_inactivity_timeout,image_format,backend,backup_id,client_type,shallow,timeout_policy) VALUES ('54097389-db69-4aa3-a34d-eb6cb2c1fc4b',1024,7,'2022-08-26 15:00:46.138+02',NULL,'bac4cca5-b6db-4d66-af65-39b8929262b7','5d18a058-652f-4c94-a9ff-9c15152c61b4','1e1846a1-f9f0-49e5-912e-2f5bf8dd8144',' https://ovirt-engine.ovirt:54323/images',12307202048,42949640192,1,false,' https://ovirt-node3.ovirt:54322/images ',3600,5,1,'7e06b6e9-92d9-4f83-ac16-9a06a638fac3',2,false,'legacy');
I don't see anything suspicious in that log - I see backups of both of the aforementioned VMs but they all seem to succeed, including snapshot removals. But it could be that something that is not covered by this log went wrong as I can't see the backup 7e06b6e9-92d9-4f83-ac16-9a06a638fac3 there.
I'm currently stuck because I cannot even remove the VMs as they are "locked during backup" operation
I see two options: 1. To stop ovirt-engine, remove the relevant parts from the command_entities table and unlock the entities (there's a script for that: https://github.com/oVirt/ovirt-engine/blob/master/packaging/setup/dbutils/un...), and then restart ovirt-engine 2. To finalize ongoing image transfers [1] and then finalize the backup [2] The second option should be much simpler [1] https://github.com/oVirt/ovirt-engine-api-model/blob/4.5.11/src/main/java/se... [2] https://github.com/oVirt/ovirt-engine-api-model/blob/4.5.11/src/main/java/se...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/RFMRQ4PDPGHPQK...

Thanks Arik, we have tried your solution but with no successful results. we have gather also other infor and combined in this solution: we have deleted on DB the row on vm_backups and vm_disk_map related to the hanged backup. The we have tried to delete shapshot locked , after the row db deletion the messange "cannot delete shapshot during backup operations" does not apper" but deletion failed anyway. so we watch the log file /var/log/ovirt-engine/engine.log ans see this messages nIs another process using the image [ /rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/f17c3443-b62f-43f5-b35c-5ba9225abaf4 then we search on the node which process olds the file with the command: lsof | grep f17c3443-b62f-43f5-b35c-5ba9225abaf4 and we found on a node that the file was in use by a process qemu-ndb we kill that process and finally we were enabled to delete the shapshot. Thanks again for your support

Do you have the logs (engine.log, vdsm.log) for this? qemu-nbd holding the lock might mean the transfer was not finalized properly and the nbd server was left open which should not happen... On Wed, Aug 31, 2022 at 1:57 PM Diego Ercolani <diego.ercolani@ssis.sm> wrote:
Thanks Arik, we have tried your solution but with no successful results. we have gather also other infor and combined in this solution:
we have deleted on DB the row on vm_backups and vm_disk_map related to the hanged backup.
The we have tried to delete shapshot locked , after the row db deletion the messange "cannot delete shapshot during backup operations" does not apper" but deletion failed anyway.
so we watch the log file /var/log/ovirt-engine/engine.log ans see this messages
nIs another process using the image [ /rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/f17c3443-b62f-43f5-b35c-5ba9225abaf4
then we search on the node which process olds the file with the command:
lsof | grep f17c3443-b62f-43f5-b35c-5ba9225abaf4
and we found on a node that the file was in use by a process qemu-ndb
we kill that process and finally we were enabled to delete the shapshot.
Thanks again for your support _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LX7ZZTJVUI6ZKW...

One process that I killed was: [root@ovirt-node4 ~]# ps axuww | grep qemu-nbd vdsm 588156 0.0 0.0 308192 39840 ? Ssl Aug26 0:12 /usr/bin/qemu-nbd --socket /run/vdsm/nbd/c7653559-508b-4e4a-a591-32dec3e5a29d.sock --persistent --shared=8 --export-name= --cache=none --aio=native --allocation-depth --read-only json:{"driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/7a05ff72-370e-4d0c-ab56-5f161cc98318/52ee7e19-ac78-4c05-81e2-0c75dad71732"}} The event that was talking about "file lock" is this in the engine.log: 2022-08-31 10:39:16,431Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-91) [617f02f1-29e6-45f4-b5bc-aa4c16b02b15] FINISH, Ge tHostJobsVDSCommand, return: {667f22f0-3df5-43b8-b94d-c0ee48424247=HostJobInfo:{id='667f22f0-3df5-43b8-b94d-c0ee48424247', type='storage', description='merge_subchain', status='failed', progress='0', error='VDSError:{co de='GeneralException', message='General Exception: ('Command [\'/usr/bin/qemu-img\', \'commit\', \'-p\', \'-t\', \'none\', \'-b\', \'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/im ages/7a05ff72-370e-4d0c-ab56-5f161cc98318/d0e6f4b9-ae29-493d-b22f-c6aee55d84ae\', \'-f\', \'qcow2\', \'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/7a05ff72-370e-4d0c-ab56-5 f161cc98318/52ee7e19-ac78-4c05-81e2-0c75dad71732\'] failed with rc=1 out=b\'\' err=bytearray(b\'qemu-img: Could not open \\\'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/7a0 5ff72-370e-4d0c-ab56-5f161cc98318/52ee7e19-ac78-4c05-81e2-0c75dad71732\\\': Failed to get "write" lock\\nIs another process using the image [/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b1 00d72/images/7a05ff72-370e-4d0c-ab56-5f161cc98318/52ee7e19-ac78-4c05-81e2-0c75dad71732]?\\n\')',)'}'}}, log id: 2c62cd00 so these are the logs related to this qemu-nbd command invocation: https://cloud.ssis.sm/index.php/s/RSpQJHEeDxai5ea The other qemu-nbd: [root@ovirt-node3 ~]# ps -ef | grep qemu-nbd vdsm 1795830 1 0 Aug26 ? 00:00:09 /usr/bin/qemu-nbd --socket /run/vdsm/nbd/54097389-db69-4aa3-a34d-eb6cb2c1fc4b.sock --persistent --shared=8 --export-name= --cache=none --aio=native --allocation-depth --read-only json:{"driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/f17c3443-b62f-43f5-b35c-5ba9225abaf4"}} The engine.log lock event: 2022-08-31 10:23:54,033Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-91) [4a3cf96d-5fb8-4479-9eda-4012d1cae8c2] FINISH, GetHostJobsVDSCommand, return: {f0df76ab-c708-4c0f-ad8e-ecf26c8910d9=HostJobInfo:{id='f0df76ab-c708-4c0f-ad8e-ecf26c8910d9', type='storage', description='merge_subchain', status='failed', progress='0', error='VDSError:{code='GeneralException', message='General Exception: ('Command [\'/usr/bin/qemu-img\', \'commit\', \'-p\', \'-t\', \'none\', \'-b\', \'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/f17c3443-b62f-43f5-b35c-5ba9225abaf4\', \'-f\', \'qcow2\', \'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/98ddfc07-8d97-4511-bfef-1ea0ce72b5a6\'] failed with rc=1 out=b\'\' err=bytearray(b\'qemu-img: Failed to get "w rite" lock\\nIs another process using the image [/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/f17c3443-b62f-43f5-b35c-5ba9225abaf4]?\\n\')',)'}'}}, log id: 380108e6 and these are related log: https://cloud.ssis.sm/index.php/s/p3LAAkXxRrAbdxK

I add also that I upgraded the engine on 2022-08-22 so I have the last "stable" since then: [root@ovirt-engine dbutils]# rpm -qi ovirt-engine-4.5.2.4-1.el8.noarch Name : ovirt-engine Version : 4.5.2.4 Release : 1.el8 Architecture: noarch Install Date: Mon Aug 22 08:17:41 2022 Group : Virtualization/Management Size : 39473100 License : ASL 2.0 Signature : RSA/SHA256, Sun Aug 21 15:16:08 2022, Key ID ab8c4f9dfe590cb7 Source RPM : ovirt-engine-4.5.2.4-1.el8.src.rpm

Thanks Diego, I checked one of the failures and I see: 2022-08-26 13:00:16,067Z ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6 ) [51fb711e-0fbc-4590-8ff0-638a041b13a5] Failed to extend proxy ticket '1e1846a1-f9f0-49e5-912e-2f5bf8dd8144' for image transfer '54097389-db69-4aa3-a34d-eb6cb2c1fc4b': {}: java.lang.RuntimeException: ImageioClient request failed. Status: 404, Reason: Not Found, Error: No such ticket: 1e1846a1-f9f0-49e5-912e-2f5bf8dd8144. at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.ImageioClient.executeRequest(ImageioClient.java:134) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.ImageioClient.extendTicket(ImageioClient.java:89) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.extendImageTransferSession(TransferDiskImageCommand.java:1353 ) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.extendTicketIfNecessary(TransferDiskImageCommand.java:785) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.handleTransferring(TransferDiskImageCommand.java:774) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.executeStateHandler(TransferDiskImageCommand.java:593) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.proceedCommandExecution(TransferDiskImageCommand.java:574) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommandCallback.doPolling(TransferImageCommandCallback.java:21) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at org.glassfish.javax.enterprise.concurrent//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(M anagedScheduledThreadPoolExecutor.java:360) at org.glassfish.javax.enterprise.concurrent//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedS cheduledThreadPoolExecutor.java:511) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) at org.glassfish.javax.enterprise.concurrent//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:227) ... I see that it happened after restarting, so it looks like it messed up the cleanup sequence and did not close the nbd server. Do you have the imageio logs? They should be available on the host performed the transfer under /var/log/ovirt-imageio/daemon.log And please submit a bug for this with these logs On Wed, Aug 31, 2022 at 4:03 PM Diego Ercolani <diego.ercolani@ssis.sm> wrote:
I add also that I upgraded the engine on 2022-08-22 so I have the last "stable" since then:
[root@ovirt-engine dbutils]# rpm -qi ovirt-engine-4.5.2.4-1.el8.noarch Name : ovirt-engine Version : 4.5.2.4 Release : 1.el8 Architecture: noarch Install Date: Mon Aug 22 08:17:41 2022 Group : Virtualization/Management Size : 39473100 License : ASL 2.0 Signature : RSA/SHA256, Sun Aug 21 15:16:08 2022, Key ID ab8c4f9dfe590cb7 Source RPM : ovirt-engine-4.5.2.4-1.el8.src.rpm
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/B4LIWGEWFNWFGV...

There is only one daemon.log per directory. Here it is the archive for the daemon.log https://cloud.ssis.sm/index.php/s/y6XxgH7CcrL5AC3 I will create the bug report referring this thread. Thank you

This is the bug report I filled: https://bugzilla.redhat.com/show_bug.cgi?id=2123008

Thanks Diego, I was able to reproduce it manually, shouldn't be too difficult to fix On Wed, Aug 31, 2022 at 5:37 PM Diego Ercolani <diego.ercolani@ssis.sm> wrote:
This is the bug report I filled: https://bugzilla.redhat.com/show_bug.cgi?id=2123008 _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZFF7FCEJJBTILX...

Thank you for the support, hoping to help improve the resiliance of the implementation.
participants (3)
-
Arik Hadas
-
Benny Zlotnik
-
Diego Ercolani