[ovirt-users] Can't remove snapshot

Adam Litke alitke at redhat.com
Wed Feb 17 16:29:01 UTC 2016


On 17/02/16 11:14 -0500, Greg Padgett wrote:
>On 02/17/2016 03:42 AM, Rik Theys wrote:
>>Hi,
>>
>>On 02/16/2016 10:52 PM, Greg Padgett wrote:
>>>On 02/16/2016 08:50 AM, Rik Theys wrote:
>>>>Hi,
>>>>
>>>>I'm trying to determine the correct "bad_img" uuid in my case.
>>>>
>>>>The VM has two snapshots:
>>>>
>>>>* The "Active VM" snapshot which has a disk that has an actual size
>>>>that's 5GB larger than the virtual size. It has a creation date that
>>>>matches the timestamp at which I created the second snapshot. The "disk
>>>>snapshot id" for this snapshot ends with dc39.
>>>>
>>>>* A "before jessie upgrade" snapshot that has status "illegal". It has
>>>>an actual size that's 2GB larger than the virtual size. The creation
>>>>date matches the date the VM was initialy created. The disk snapshot id
>>>>ends with 6249.
>>>>
>>>>  From the above I conclude that the disk with id that ends with 6249 is
>>>>the "bad" img I need to specify.
>>>
>>>Similar to what I wrote to Marcelo above in the thread, I'd recommend
>>>running the "VM disk info gathering tool" attached to [1].  It's the
>>>best way to ensure the merge was completed and determine which image is
>>>the "bad" one that is no longer in use by any volume chains.
>>
>>I've ran the disk info gathering tool and this outputs (for the affected
>>VM):
>>
>>VM lena
>>     Disk b2390535-744f-4c02-bdc8-5a897226554b
>>(sd:a7ba2db3-517c-408a-8b27-ea45989d6416)
>>     Volumes:
>>         24d78600-22f4-44f7-987b-fbd866736249
>>
>>The id of the volume is the ID of the snapshot that is marked "illegal".
>>So the "bad" image would be the dc39 one, which according to the UI is
>>in use by the "Active VM" snapshot. Can this make sense?
>
>It looks accurate.  Live merges are "backwards" merges, so the merge 
>would have pushed data from the volume associated with "Active VM" 
>into the volume associated with the snapshot you're trying to remove.
>
>Upon completion, we "pivot" so that the VM uses that older volume, and 
>we update the engine database to reflect this (basically we 
>re-associate that older volume with, in your case, "Active VM").
>
>In your case, it seems the pivot operation was done, but the database 
>wasn't updated to reflect it.  Given snapshot/image associations e.g.:
>
>  VM Name  Snapshot Name  Volume
>  -------  -------------  ------
>  My-VM    Active VM      123-abc
>  My-VM    My-Snapshot    789-def
>
>My-VM in your case is actually running on volume 789-def.  If you run 
>the db fixup script and supply ("My-VM", "My-Snapshot", "123-abc") 
>(note the volume is the newer, "bad" one), then it will switch the 
>volume association for you and remove the invalid entries.
>
>Of course, I'd shut down the VM, and back up the db beforehand.
>
>I'm not terribly familiar with how vdsm handles block storage, but I'd 
>image you could then e.g. `lvchange -an` the bad volume's LV, start 
>the VM, and verify that the data is current without having the 
>to-be-removed volume active, just to make sure everything lines up 
>before running the vdsClient verb to remove the volume.

vdsm will reactivate the lv when starting the vm so this check will
not work.  The vm-disk-info.py script is using libvirt to check which
volumes the VM actually has open so if the 'bad' volume is not listed,
then it is not being used and is safe to remove.

>
>>Both the "Active VM" and the defective snapshot have an actual size
>>that's bigger than the virtual size of the disk. When I remove the bad
>>disk image/snapshot, will the actual size of the "Active VM" snapshot
>>return to the virtual size of the disk? What's currently stored in the
>>"Active VM" snapshot?
>
>"Active VM" should now be unused; it previously (pre-merge) was the 
>data written since the snapshot was taken.  Normally the larger actual 
>size might be from qcow format overhead.  If your listing above is 
>complete (ie one volume for the vm), then I'm not sure why the base 
>volume would have a larger actual size than virtual size.
>
>Adam, Nir--any thoughts on this?

There is a bug which has caused inflation of the snapshot volumes when
performing a live merge.  We are submitting fixes for 3.5, 3.6, and
master right at this moment.

>
>>Would cloning the VM (and removing the original VM afterwards) work as
>>an alternate way to clean this up? Or will the clone operation also
>>clone the snapshots?
>
>It would try to clone everything in the engine db, so no luck there.
>
>>Regards,
>>
>>Rik
>>
>>>If indeed the "bad" image (whichever one it is) is no longer in use,
>>>then it's possible the image wasn't successfully removed from storage.
>>>There are 2 ways to fix this:
>>>
>>>   a) Run the db fixup script to remove the records for the merged image,
>>>      and run the vdsm command by hand to remove it from storage.
>>>   b) Adjust the db records so a merge retry would start at the right
>>>      place, and re-run live merge.
>>>
>>>Given that your merge retries were failing, option a) seems most likely
>>>to succeed.  The db fixup script is attached to [1]; as parameters you
>>>would need to provide the vm name, snapshot name, and the id of the
>>>unused image as verified by the disk info tool.
>>>
>>>To remove the stale LV, the vdsm deleteVolume verb would then be run
>>>from `vdsClient` -- but note that this must be run _on the SPM host_.
>>>It will not only perform lvremove, but also do housekeeping on other
>>>storage metadata to keep everything consistent.  For this verb I believe
>>>you'll need to supply not only the unused image id, but also the pool,
>>>domain, and image group ids from your database queries.
>>>
>>>I hope that helps.
>>>
>>>Greg
>>>
>>>[1] https://bugzilla.redhat.com/show_bug.cgi?id=1306741
>>>
>>>>
>>>>However, I grepped the output from 'lvs' on the SPM host of the cluster
>>>>and both disk id's are returned:
>>>>
>>>>[root at amazone ~]# lvs | egrep 'cd39|6249'
>>>>    24d78600-22f4-44f7-987b-fbd866736249
>>>>a7ba2db3-517c-408a-8b27-ea45989d6416 -wi-ao----   34.00g
>>>>
>>>>    81458622-aa54-4f2f-b6d8-75e7db36cd39
>>>>a7ba2db3-517c-408a-8b27-ea45989d6416 -wi-------    5.00g
>>>>
>>>>
>>>>I expected the "bad" img would no longer be found?
>>>>
>>>>The SQL script only cleans up the database and not the logical volumes.
>>>>Would running the script not keep a stale LV around?
>>>>
>>>>Also, from the lvs output it seems the "bad" disk is bigger than the
>>>>"good" one.
>>>>
>>>>Is it possible the snapshot still needs to be merged?? If so, how can I
>>>>initiate that?
>>>>
>>>>Regards,
>>>>
>>>>Rik
>>>>
>>>>
>>>>On 02/16/2016 02:02 PM, Rik Theys wrote:
>>>>>Hi Greg,
>>>>>
>>>>>>
>>>>>>2016-02-09 21:30 GMT-03:00 Greg Padgett <gpadgett at redhat.com>:
>>>>>>>On 02/09/2016 06:08 AM, Michal Skrivanek wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>>On 03 Feb 2016, at 10:37, Rik Theys <Rik.Theys at esat.kuleuven.be>
>>>>>>>>>wrote:
>>>>>
>>>>>>>>>>I can see the snapshot in the "Disk snapshot" tab of the
>>>>>>>>>>storage. It has
>>>>>>>>>>a status of "illegal". Is it OK to (try to) remove this
>>>>>>>>>>snapshot? Will
>>>>>>>>>>this impact the running VM and/or disk image?
>>>>>>>>
>>>>>>>>
>>>>>>>>No, it’s not ok to remove it while live merge(apparently) is still
>>>>>>>>ongoing
>>>>>>>>I guess that’s a live merge bug?
>>>>>>>
>>>>>>>
>>>>>>>Indeed, this is bug 1302215.
>>>>>>>
>>>>>>>I wrote a sql script to help with cleanup in this scenario, which
>>>>>>>you can
>>>>>>>find attached to the bug along with a description of how to use it[1].
>>>>>>>
>>>>>>>However, Rik, before trying that, would you be able to run the
>>>>>>>attached
>>>>>>>script [2] (or just the db query within) and forward the output to
>>>>>>>me? I'd
>>>>>>>like to make sure everything looks as it should before modifying
>>>>>>>the db
>>>>>>>directly.
>>>>>
>>>>>I ran the following query on the engine database:
>>>>>
>>>>>select images.* from images join snapshots ON (images.vm_snapshot_id =
>>>>>snapshots.snapshot_id)
>>>>>join vm_static on (snapshots.vm_id = vm_static.vm_guid)
>>>>>where vm_static.vm_name = 'lena' and snapshots.description='before
>>>>>jessie upgrade';
>>>>>
>>>>>The resulting output is:
>>>>>
>>>>>                image_guid              |     creation_date      |
>>>>>size
>>>>>      |               it_guid                |               parentid
>>>>>            | images
>>>>>tatus |        lastmodified        |            vm_snapshot_id
>>>>>    | volume_type | volume_format |
>>>>>image_group_id            |
>>>>>          _create_da
>>>>>te          |         _update_date          | active |
>>>>>volume_classification
>>>>>--------------------------------------+------------------------+-------------+--------------------------------------+--------------------------------------+-------
>>>>>
>>>>>------+----------------------------+--------------------------------------+-------------+---------------+--------------------------------------+-------------------
>>>>>
>>>>>------------+-------------------------------+--------+-----------------------
>>>>>
>>>>>   24d78600-22f4-44f7-987b-fbd866736249 | 2015-05-19 15:00:13+02 |
>>>>>34359738368 | 00000000-0000-0000-0000-000000000000 |
>>>>>00000000-0000-0000-0000-000000000000 |
>>>>>      4 | 2016-01-30 08:45:59.998+01 |
>>>>>4b4930ed-b52d-47ec-8506-245b7f144102 |           1 |             5 |
>>>>>b2390535-744f-4c02-bdc8-5a897226554b | 2015-05-19 15:00:1
>>>>>1.864425+02 | 2016-01-30 08:45:59.999422+01 | f
>>>>>|                     1
>>>>>(1 row)
>>>>>
>>>>>Regards,
>>>>>
>>>>>Rik
>>>>>
>>>>>
>>>>>>>
>>>>>>>Thanks,
>>>>>>>Greg
>>>>>>>
>>>>>>>[1] https://bugzilla.redhat.com/show_bug.cgi?id=1302215#c13
>>>>>>>(Also note that the engine should be stopped before running this.)
>>>>>>>
>>>>>>>[2] Arguments are the ovirt db name, db user, and the name of the
>>>>>>>vm you
>>>>>>>were performing live merge on.
>>>>>>>
>>>>>>>
>>>>>>>>Thanks,
>>>>>>>>michal
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Regards,
>>>>>>>>>
>>>>>>>>>Rik
>>>>>>>>>
>>>>>>>>>On 02/03/2016 10:26 AM, Rik Theys wrote:
>>>>>>>>>>
>>>>>>>>>>Hi,
>>>>>>>>>>
>>>>>>>>>>I created a snapshot of a running VM prior to an OS upgrade. The OS
>>>>>>>>>>upgrade has now been succesful and I would like to remove the
>>>>>>>>>>snapshot.
>>>>>>>>>>I've selected the snapshot in the UI and clicked Delete to start
>>>>>>>>>>the
>>>>>>>>>>task.
>>>>>>>>>>
>>>>>>>>>>After a few minutes, the task has failed. When I click delete
>>>>>>>>>>again on
>>>>>>>>>>the same snapshot, the failed message is returned after a few
>>>>>>>>>>seconds.
>>>>>>>>>>
>>>>>>>>>>>   From browsing through the engine log (attached) it seems the
>>>>>>>>>>>snapshot
>>>>>>>>>>
>>>>>>>>>>was correctly merged in the first try but something went wrong
>>>>>>>>>>in the
>>>>>>>>>>finalizing fase. On retries, the log indicates the snapshot/disk
>>>>>>>>>>image
>>>>>>>>>>no longer exists and the removal of the snapshot fails for this
>>>>>>>>>>reason.
>>>>>>>>>>
>>>>>>>>>>Is there any way to clean up this snapshot?
>>>>>>>>>>
>>>>>>>>>>I can see the snapshot in the "Disk snapshot" tab of the
>>>>>>>>>>storage. It has
>>>>>>>>>>a status of "illegal". Is it OK to (try to) remove this
>>>>>>>>>>snapshot? Will
>>>>>>>>>>this impact the running VM and/or disk image?
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

-- 
Adam Litke



More information about the Users mailing list