Re: [ovirt-users] qemu-kvm images corruption

6 Feb 2018

      Hello,

On our two 3.6 DCs, we're still facing qcow2 corruptions, even on 
freshly installed VMs (CentOS7, win2012, win2008...).

(We are still hoping to find some time to migrate all this to 4.2, but 
it's a big work and our one-person team - me - is overwhelmed.)

My workaround is described in my previous thread below, but it's just a 
workaround.

Reading further, I found that :

https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-d...

There are many things I don't know or understand, and I'd like your 
opinion :

- Is "virtio" is synonym of "virtio-blk"?
- Is it true that the development of virtio-scsi is active and the one 
of virtio is stopped?
- People in the proxmox forum seem to say that no qcow2 corruption 
occurs when using IDE (not an option for me) neither virtio-scsi. Does 
any Redhat people ever heard of this?
- Is converting all my VMs to use virtio-scsi a guarantee against 
further corruptions?
- What is the non-official but nonetheless recommended driver oVirt devs 
recommend in the sense of future, development and stability?

Regards,

-- 
Nicolas ECARNOT

Le 15/09/2017 à 14:06, Nicolas Ecarnot a écrit :
...
TL;DR:
How to avoid images corruption?
Hello,
On two of our old 3.6 DC, a recent series of VM migrations lead to some 
issues :
- I'm putting a host into maintenance mode
- most of the VM are migrating nicely
- one remaining VM never migrates, and the logs are showing :
* engine.log : "...VM has been paused due to I/O error..."
* vdsm.log : "...Improbable extension request for volume..."
After digging amongst the RH BZ tickets, I saved the day by :
- stopping the VM
- lvchange -ay the adequate /dev/...
- qemu-img check [-r all] /rhev/blahblah
- lvchange -an...
- boot the VM
- enjoy!
Yesterday this worked for a VM where only one error occurred on the qemu 
image, and the repair was easily done by qemu-img.
Today, facing the same issue on another VM, it failed because the errors 
were very numerous, and also because of this message :
[...]
Rebuilding refcount structure
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device
[...]
The PV/VG/LV are far from being full, so I guess I don't where to look at.
I tried many ways to solve it but I'm not comfortable at all with qemu 
images, corruption and solving, so I ended up exporting this VM (to an 
NFS export domain), importing it into another DC : this had the side 
effect to use qemu-img convert from qcow2 to qcow2, and (maybe?????) to 
solve some errors???
I also copied it into another qcow2 file with the same qemu-img convert 
way, but it is leading to another clean qcow2 image without errors.
I saw that on 4.x some bugs are fixed about VM migrations, but this is 
not the point here.
I checked my SANs, my network layers, my blades, the OS (CentOS 7.2) of 
my hosts, but I see nothing special.
The real reason behind my message is not to know how to repair anything, 
rather than to understand what could have lead to this situation?
Where to keep a keen eye?
-- 
Nicolas ECARNOT