Re: [ovirt-users] Can't start VM after shutdown

12 Jun 2016

      Hi,

I solved my problem, here are the steps but be carefully if you don't
know what the commands did and how to restore from backup don't follow this:

- ssh to the host
- systemctl stop ovirt-engine
- backup the database with "engine-backup"
- navigate to the image files
- backup the images: sudo -u vdsm rsync -av <uuid> <uuid_backup>
- check which one is the backing file: qemu-img info <file>
- check for damages: qemu-img check <file>
- qemu-img commit <snapshot file>
- rename the <snapshot file> + .lease and .meta so it can't be accessed

- vmname=srv03
- db=engine
- sudo -u postgres psql $db -c "SELECT b.disk_alias, s.description,
s.snapshot_id, i.creation_date, s.status, i.imagestatus, i.size,
i.image_group_id, i.vm_snapshot_id, i.image_guid, i.parentid, i.active
FROM images as i JOIN snapshots AS s ON (i.vm_snapshot_id =
s.snapshot_id) LEFT JOIN vm_static AS v ON (s.vm_id = v.vm_guid) JOIN
base_disks AS b ON (i.image_group_id = b.disk_id) WHERE v.vm_name =
'$vmname' ORDER BY creation_date, description, disk_alias"

- note the image_guid and parent_id from the broken snapshot and the
active snapshot, the active state is the image_guuid with the parentid
00000000-0000-0000-0000-000000000000
- igid_active=<active uuid>
- igid_broken=<broken uuid>
- the parentid of the image_guuid of the broken snapshot must be the
same as the activ snapshots image_guuid
- note the snapshot id
- sid_active=<id of the active snapshot with parrent id 000000>
- sid_broken=<id of the broken shapshot>

- delete the broken snapshot
- sudo -u postgres psql $db -c "DELETE FROM snapshots AS s WHERE
s.snapshot_id = '$sid_broken'"

- pid_new=00000000-0000-0000-0000-000000000000
- sudo -u postgres psql $db -c "SELECT * FROM images WHERE
vm_snapshot_id = '$sid_active' AND image_guid = '$igid_broken'"
- sudo -u postgres psql $db -c "DELETE FROM images WHERE vm_snapshot_id
= '$sid_broken' AND image_guid = '$igid_active'"
- sudo -u postgres psql $db -c "SELECT * FROM image_storage_domain_map
WHERE image_id = '$igid_broken'"
- sudo -u postgres psql $db -c "DELETE FROM image_storage_domain_map
WHERE image_id = '$igid_broken'"
- sudo -u postgres psql $db -c "UPDATE images SET image_guid =
'$igid_active', parentid = '$pid_new' WHERE vm_snapshot_id =
'$sid_active' AND image_guid = '$igid_broken'"
- sudo -u postgres psql $db -c "SELECT * FROM image_storage_domain_map"
- storid=<storage_domain_id>
- diskprofileid=<disk_profile_id>
- sudo -u postgres psql $db -c "INSERT INTO image_storage_domain_map
(image_id, storage_domain_id, disk_profile_id) VALUES ('$igid_broken',
'$stor_id', '$diskprofileid')"

- check values
- sudo -u postgres psql $db -c "SELECT b.disk_alias, s.description,
s.snapshot_id, i.creation_date, s.status, i.imagestatus, i.size,
i.image_group_id, i.vm_snapshot_id, i.image_guid, i.parentid, i.active
FROM images as i JOIN snapshots AS s ON (i.vm_snapshot_id =
s.snapshot_id) LEFT JOIN vm_static AS v ON (s.vm_id = v.vm_guid) JOIN
base_disks AS b ON (i.image_group_id = b.disk_id) WHERE v.vm_name =
'$vmname' ORDER BY creation_date, description, disk_alias"could not
change directory to "/root/Backups/oVirt"

- check for errors
- engine-setup --offline
- systemctl start ovirt-engine

Now you should have a clean state and a working VM ;-)

What was tested:
- Power up and down the VM

What does not work:
- Its not possible to make offline snapshots, online was not tested
because I will not getting into such trouble again. It took many hours
after the machine is up again.

PLEASE be aware and don't destroy your Host and VM !!!

cheers
gregor

On 12/06/16 13:40, Colin Coe wrote:
...
We've seen this with both Linux and Windows VMs.  I'm guessing that
you've had failures on this VM in both snapshot create and delete
operations.  oVirt/RHEV 3.5 seems particularly affected.  I'm told that
oVirt 3.6.7 has the last of the fixes for these known snapshot problems.
My original email was eorded wrong.  I meant that qemu-img gives
"backing filename too long" errors.  You may have seen this in your logs.
Note also that you may be seeing an entirely un-related problem.
You may wish to post you're VDSM logs and the qemu log from
/var/lib/libvirt/qemu/<vm_name>.log
Hope this helps
CC
On Sun, Jun 12, 2016 at 4:45 PM, gregor <gregor_forum@catrix.at
<mailto:gregor_forum@catrix.at>> wrote:
Sound's bad. Recreating the VM is no way because this is a productive
    VM. During testing I need to recreate it more than once. oVirt works
    perfect which Linux VM's but when it comes to Windows VM's we get lots
    of problems.
Which OS you used on the problematic VM?
cheers
    gregor
On 11/06/16 19:22, Anantha Raghava wrote:
    > Hi,
    >
    > Even I observed this behaviour.
    >
    > When we take the snapshot, the main VM using which the snapshot was
    > taken is shutdown and a new VM with external-<VMName> comes to
    life. We
    > cannot get the original VM back to life, but a clone starts
    functioning.
    >
    > We cannot remove the snapshot whether or not the VM is running. I
    had to
    > remove the entire VM that came to life with snapshot and recreate the
    > entire VM from scratch. Luckily the VM was still not in production,
    > hence could afford it.
    >
    > First I could not understand, why, when a snapshot is created, the VM
    > with snapshot comes to life and starts running and not the
    original VM.
    >
    > Is it necessary that we shutdown the VM before taking snapshots?
    > Snapshot is supposed to be a backup of original VM, that unless we
    > restore by cloning should not come to life as I understand.
    >
    > --
    >
    > Thanks & Regards,
    >
    > Anantha Raghava
    >
    >
    > On Saturday 11 June 2016 08:09 PM, gregor wrote:
    >> Hi,
    >>
    >> a VM has snapshots which are unable to remove during when the VM
    is up.
    >> Therefore I power down the Windows Server 2012 VM. The snapshots are
    >> still unable to remove and the VM can't boot anymore !!!
    >>
    >> This is the message from engine.log
    >>
    >> ------------------
    >> Message: VM srv03 is down with error. Exit message: Bad volume
    specification
    >> ------------------
    >>
    >> Clone is not possible I get:
    >> ------------------
    >> Message: VDSM command failed: Image is not a legal chain
    >> ------------------
    >>
    >> All others VM's can be powered down and start without any problem.
    >> What can I do?
    >> This is very important because now no one can work :-( !!!
    >>
    >> cheers
    >> gregor
    >> _______________________________________________
    >> Users mailing list
    >> Users@ovirt.org <mailto:Users@ovirt.org>
    >> http://lists.ovirt.org/mailman/listinfo/users
    >
    _______________________________________________
    Users mailing list
    Users@ovirt.org <mailto:Users@ovirt.org>
    http://lists.ovirt.org/mailman/listinfo/users