how kill backup operation

older
Interested in contributing with...

Diego Ercolani

26 Aug 2022 26 Aug '22

5:24 p.m.

Hello I saw there are other thread asking how to delete disk snapshots from backup operation. We definitively need a tool to kill pending backup operations and locked snapshots. I Think this is very frustrating ovirt is a good piece of software but it's very immature in a dirty asyncronous world. We need a unified toolbox to clean manually and do database housekeeping.

Show replies by date

Arik Hadas

28 Aug 28 Aug

10:34 a.m.

On Fri, Aug 26, 2022 at 6:25 PM Diego Ercolani <diego.ercolani@ssis.sm> wrote:

...

Hello I saw there are other thread asking how to delete disk snapshots from backup operation. We definitively need a tool to kill pending backup operations and locked snapshots.

I Think this is very frustrating ovirt is a good piece of software but it's

...

very immature in a dirty asyncronous world. We need a unified toolbox to clean manually and do database housekeeping.

Note that the thread you refer to is about a snapshot-based mechanism for backup. While we still test it (and unfortunately didn't notice the reported issues on our environments, thus we need more information as Benny pointed out), we have been putting our efforts on an alternative mechanism that is based on incremental backup. This mechanism is supported (since oVirt 4.5.1 I believe) and should provide you with ways to finalize backups. It would be great if you can elaborate on what "pending backup operations" means to see it's covered by that new mechanism we call "Hybrid backup": https://www.ovirt.org/media/Hybrid-backup-v8.pdf

...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/MNVW4FT3Y24ATI...

Diego Ercolani

29 Aug 29 Aug

1:19 p.m.

Thank you for your support, I'm councious about the difficulty to keep everything in-line. I currently try to find the correct workload to make backup (using CBR) of VMs. I tryied both vprotect (with current tecnology preview) and Veeam (community using RHV plugin) And I'm currently experiencing very annoying problems. I can give you the engine-log https://cloud.ssis.sm/index.php/s/M9DqFHSaowYqa9H, I currently have two machines in an unconsinstent stato from the snapshot point of view: SSIS-otobo and SSIS-TPayX2go. I empied the image_transfer table, last time it helped. This is the sql to restore it: INSERT INTO public.image_transfers (command_id,command_type,phase,last_updated,message,vds_id,disk_id,imaged_ticket_id,proxy_uri,bytes_sent,bytes_total,"type",active,daemon_uri,client_inactivity_timeout,image_format,backend,backup_id,client_type,shallow,timeout_policy) VALUES ('54097389-db69-4aa3-a34d-eb6cb2c1fc4b',1024,7,'2022-08-26 15:00:46.138+02',NULL,'bac4cca5-b6db-4d66-af65-39b8929262b7','5d18a058-652f-4c94-a9ff-9c15152c61b4','1e1846a1-f9f0-49e5-912e-2f5bf8dd8144','https://ovirt-engine.ovirt:54323/images',12307202048,42949640192,1,false,'https://ovirt-node3.ovirt:54322/images',3600,5,1,'7e06b6e9-92d9-4f83-ac16-9a06a638fac3',2,false,'legacy'); I'm currently stuck because I cannot even remove the VMs as they are "locked during backup" operation

Arik Hadas

5:10 p.m.

On Mon, Aug 29, 2022 at 2:20 PM Diego Ercolani <diego.ercolani@ssis.sm> wrote:

...

Thank you for your support, I'm councious about the difficulty to keep everything in-line. I currently try to find the correct workload to make backup (using CBR) of VMs. I tryied both vprotect (with current tecnology preview) and Veeam (community using RHV plugin) And I'm currently experiencing very annoying problems. I can give you the engine-log https://cloud.ssis.sm/index.php/s/M9DqFHSaowYqa9H, I currently have two machines in an unconsinstent stato from the snapshot point of view: SSIS-otobo and SSIS-TPayX2go.

I empied the image_transfer table, last time it helped. This is the sql to

...

restore it: INSERT INTO public.image_transfers (command_id,command_type,phase,last_updated,message,vds_id,disk_id,imaged_ticket_id,proxy_uri,bytes_sent,bytes_total,"type",active,daemon_uri,client_inactivity_timeout,image_format,backend,backup_id,client_type,shallow,timeout_policy) VALUES ('54097389-db69-4aa3-a34d-eb6cb2c1fc4b',1024,7,'2022-08-26 15:00:46.138+02',NULL,'bac4cca5-b6db-4d66-af65-39b8929262b7','5d18a058-652f-4c94-a9ff-9c15152c61b4','1e1846a1-f9f0-49e5-912e-2f5bf8dd8144',' https://ovirt-engine.ovirt:54323/images',12307202048,42949640192,1,false,' https://ovirt-node3.ovirt:54322/images ',3600,5,1,'7e06b6e9-92d9-4f83-ac16-9a06a638fac3',2,false,'legacy');

I don't see anything suspicious in that log - I see backups of both of the aforementioned VMs but they all seem to succeed, including snapshot removals. But it could be that something that is not covered by this log went wrong as I can't see the backup 7e06b6e9-92d9-4f83-ac16-9a06a638fac3 there.

...

I'm currently stuck because I cannot even remove the VMs as they are "locked during backup" operation

I see two options: 1. To stop ovirt-engine, remove the relevant parts from the command_entities table and unlock the entities (there's a script for that: https://github.com/oVirt/ovirt-engine/blob/master/packaging/setup/dbutils/un...), and then restart ovirt-engine 2. To finalize ongoing image transfers [1] and then finalize the backup [2] The second option should be much simpler [1] https://github.com/oVirt/ovirt-engine-api-model/blob/4.5.11/src/main/java/se... [2] https://github.com/oVirt/ovirt-engine-api-model/blob/4.5.11/src/main/java/se...

...

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/RFMRQ4PDPGHPQK...

Diego Ercolani

31 Aug 31 Aug

12:57 p.m.

Thanks Arik, we have tried your solution but with no successful results. we have gather also other infor and combined in this solution: we have deleted on DB the row on vm_backups and vm_disk_map related to the hanged backup. The we have tried to delete shapshot locked , after the row db deletion the messange "cannot delete shapshot during backup operations" does not apper" but deletion failed anyway. so we watch the log file /var/log/ovirt-engine/engine.log ans see this messages nIs another process using the image [ /rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/f17c3443-b62f-43f5-b35c-5ba9225abaf4 then we search on the node which process olds the file with the command: lsof | grep f17c3443-b62f-43f5-b35c-5ba9225abaf4 and we found on a node that the file was in use by a process qemu-ndb we kill that process and finally we were enabled to delete the shapshot. Thanks again for your support

Benny Zlotnik

1:41 p.m.

Do you have the logs (engine.log, vdsm.log) for this? qemu-nbd holding the lock might mean the transfer was not finalized properly and the nbd server was left open which should not happen... On Wed, Aug 31, 2022 at 1:57 PM Diego Ercolani <diego.ercolani@ssis.sm> wrote:

...

Thanks Arik, we have tried your solution but with no successful results. we have gather also other infor and combined in this solution:

we have deleted on DB the row on vm_backups and vm_disk_map related to the hanged backup.

The we have tried to delete shapshot locked , after the row db deletion the messange "cannot delete shapshot during backup operations" does not apper" but deletion failed anyway.

so we watch the log file /var/log/ovirt-engine/engine.log ans see this messages

nIs another process using the image [ /rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/f17c3443-b62f-43f5-b35c-5ba9225abaf4

then we search on the node which process olds the file with the command:

lsof | grep f17c3443-b62f-43f5-b35c-5ba9225abaf4

and we found on a node that the file was in use by a process qemu-ndb

we kill that process and finally we were enabled to delete the shapshot.

Thanks again for your support _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LX7ZZTJVUI6ZKW...

Diego Ercolani

2:55 p.m.

One process that I killed was: [root@ovirt-node4 ~]# ps axuww | grep qemu-nbd vdsm 588156 0.0 0.0 308192 39840 ? Ssl Aug26 0:12 /usr/bin/qemu-nbd --socket /run/vdsm/nbd/c7653559-508b-4e4a-a591-32dec3e5a29d.sock --persistent --shared=8 --export-name= --cache=none --aio=native --allocation-depth --read-only json:{"driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/7a05ff72-370e-4d0c-ab56-5f161cc98318/52ee7e19-ac78-4c05-81e2-0c75dad71732"}} The event that was talking about "file lock" is this in the engine.log: 2022-08-31 10:39:16,431Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-91) [617f02f1-29e6-45f4-b5bc-aa4c16b02b15] FINISH, Ge tHostJobsVDSCommand, return: {667f22f0-3df5-43b8-b94d-c0ee48424247=HostJobInfo:{id='667f22f0-3df5-43b8-b94d-c0ee48424247', type='storage', description='merge_subchain', status='failed', progress='0', error='VDSError:{co de='GeneralException', message='General Exception: ('Command [\'/usr/bin/qemu-img\', \'commit\', \'-p\', \'-t\', \'none\', \'-b\', \'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/im ages/7a05ff72-370e-4d0c-ab56-5f161cc98318/d0e6f4b9-ae29-493d-b22f-c6aee55d84ae\', \'-f\', \'qcow2\', \'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/7a05ff72-370e-4d0c-ab56-5 f161cc98318/52ee7e19-ac78-4c05-81e2-0c75dad71732\'] failed with rc=1 out=b\'\' err=bytearray(b\'qemu-img: Could not open \\\'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/7a0 5ff72-370e-4d0c-ab56-5f161cc98318/52ee7e19-ac78-4c05-81e2-0c75dad71732\\\': Failed to get "write" lock\\nIs another process using the image [/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b1 00d72/images/7a05ff72-370e-4d0c-ab56-5f161cc98318/52ee7e19-ac78-4c05-81e2-0c75dad71732]?\\n\')',)'}'}}, log id: 2c62cd00 so these are the logs related to this qemu-nbd command invocation: https://cloud.ssis.sm/index.php/s/RSpQJHEeDxai5ea The other qemu-nbd: [root@ovirt-node3 ~]# ps -ef | grep qemu-nbd vdsm 1795830 1 0 Aug26 ? 00:00:09 /usr/bin/qemu-nbd --socket /run/vdsm/nbd/54097389-db69-4aa3-a34d-eb6cb2c1fc4b.sock --persistent --shared=8 --export-name= --cache=none --aio=native --allocation-depth --read-only json:{"driver": "qcow2", "file": {"driver": "file", "filename": "/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/f17c3443-b62f-43f5-b35c-5ba9225abaf4"}} The engine.log lock event: 2022-08-31 10:23:54,033Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-91) [4a3cf96d-5fb8-4479-9eda-4012d1cae8c2] FINISH, GetHostJobsVDSCommand, return: {f0df76ab-c708-4c0f-ad8e-ecf26c8910d9=HostJobInfo:{id='f0df76ab-c708-4c0f-ad8e-ecf26c8910d9', type='storage', description='merge_subchain', status='failed', progress='0', error='VDSError:{code='GeneralException', message='General Exception: ('Command [\'/usr/bin/qemu-img\', \'commit\', \'-p\', \'-t\', \'none\', \'-b\', \'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/f17c3443-b62f-43f5-b35c-5ba9225abaf4\', \'-f\', \'qcow2\', \'/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/98ddfc07-8d97-4511-bfef-1ea0ce72b5a6\'] failed with rc=1 out=b\'\' err=bytearray(b\'qemu-img: Failed to get "w rite" lock\\nIs another process using the image [/rhev/data-center/mnt/glusterSD/localhost:_gv0/60b7f172-08ed-4a22-8414-31fd5b100d72/images/5d18a058-652f-4c94-a9ff-9c15152c61b4/f17c3443-b62f-43f5-b35c-5ba9225abaf4]?\\n\')',)'}'}}, log id: 380108e6 and these are related log: https://cloud.ssis.sm/index.php/s/p3LAAkXxRrAbdxK

Diego Ercolani

3:02 p.m.

I add also that I upgraded the engine on 2022-08-22 so I have the last "stable" since then: [root@ovirt-engine dbutils]# rpm -qi ovirt-engine-4.5.2.4-1.el8.noarch Name : ovirt-engine Version : 4.5.2.4 Release : 1.el8 Architecture: noarch Install Date: Mon Aug 22 08:17:41 2022 Group : Virtualization/Management Size : 39473100 License : ASL 2.0 Signature : RSA/SHA256, Sun Aug 21 15:16:08 2022, Key ID ab8c4f9dfe590cb7 Source RPM : ovirt-engine-4.5.2.4-1.el8.src.rpm

Benny Zlotnik

3:33 p.m.

Thanks Diego, I checked one of the failures and I see: 2022-08-26 13:00:16,067Z ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6 ) [51fb711e-0fbc-4590-8ff0-638a041b13a5] Failed to extend proxy ticket '1e1846a1-f9f0-49e5-912e-2f5bf8dd8144' for image transfer '54097389-db69-4aa3-a34d-eb6cb2c1fc4b': {}: java.lang.RuntimeException: ImageioClient request failed. Status: 404, Reason: Not Found, Error: No such ticket: 1e1846a1-f9f0-49e5-912e-2f5bf8dd8144. at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.ImageioClient.executeRequest(ImageioClient.java:134) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.ImageioClient.extendTicket(ImageioClient.java:89) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.extendImageTransferSession(TransferDiskImageCommand.java:1353 ) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.extendTicketIfNecessary(TransferDiskImageCommand.java:785) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.handleTransferring(TransferDiskImageCommand.java:774) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.executeStateHandler(TransferDiskImageCommand.java:593) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand.proceedCommandExecution(TransferDiskImageCommand.java:574) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image.TransferImageCommandCallback.doPolling(TransferImageCommandCallback.java:21) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at org.glassfish.javax.enterprise.concurrent//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(M anagedScheduledThreadPoolExecutor.java:360) at org.glassfish.javax.enterprise.concurrent//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedS cheduledThreadPoolExecutor.java:511) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) at org.glassfish.javax.enterprise.concurrent//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:227) ... I see that it happened after restarting, so it looks like it messed up the cleanup sequence and did not close the nbd server. Do you have the imageio logs? They should be available on the host performed the transfer under /var/log/ovirt-imageio/daemon.log And please submit a bug for this with these logs On Wed, Aug 31, 2022 at 4:03 PM Diego Ercolani <diego.ercolani@ssis.sm> wrote:

...

I add also that I upgraded the engine on 2022-08-22 so I have the last "stable" since then:

[root@ovirt-engine dbutils]# rpm -qi ovirt-engine-4.5.2.4-1.el8.noarch Name : ovirt-engine Version : 4.5.2.4 Release : 1.el8 Architecture: noarch Install Date: Mon Aug 22 08:17:41 2022 Group : Virtualization/Management Size : 39473100 License : ASL 2.0 Signature : RSA/SHA256, Sun Aug 21 15:16:08 2022, Key ID ab8c4f9dfe590cb7 Source RPM : ovirt-engine-4.5.2.4-1.el8.src.rpm

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/B4LIWGEWFNWFGV...

...

This is the bug report I filled: https://bugzilla.redhat.com/show_bug.cgi?id=2123008 _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZFF7FCEJJBTILX...

Diego Ercolani

1 Sep 1 Sep

1:33 p.m.

Thank you for the support, hoping to help improve the resiliance of the implementation.

1122

Age (days ago)

1128

Last active (days ago)

List overview

Download

12 comments

3 participants

participants (3)

Arik Hadas
Benny Zlotnik
Diego Ercolani

how kill backup operation

tags

participants (3)