qemu-kvm images corruption

15 Sep 2017

      TL;DR:
How to avoid images corruption?

Hello,

On two of our old 3.6 DC, a recent series of VM migrations lead to some 
issues :
- I'm putting a host into maintenance mode
- most of the VM are migrating nicely
- one remaining VM never migrates, and the logs are showing :

* engine.log : "...VM has been paused due to I/O error..."
* vdsm.log : "...Improbable extension request for volume..."

After digging amongst the RH BZ tickets, I saved the day by :
- stopping the VM
- lvchange -ay the adequate /dev/...
- qemu-img check [-r all] /rhev/blahblah
- lvchange -an...
- boot the VM
- enjoy!

Yesterday this worked for a VM where only one error occurred on the qemu 
image, and the repair was easily done by qemu-img.

Today, facing the same issue on another VM, it failed because the errors 
were very numerous, and also because of this message :

[...]
Rebuilding refcount structure
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device
[...]

The PV/VG/LV are far from being full, so I guess I don't where to look at.
I tried many ways to solve it but I'm not comfortable at all with qemu 
images, corruption and solving, so I ended up exporting this VM (to an 
NFS export domain), importing it into another DC : this had the side 
effect to use qemu-img convert from qcow2 to qcow2, and (maybe?????) to 
solve some errors???
I also copied it into another qcow2 file with the same qemu-img convert 
way, but it is leading to another clean qcow2 image without errors.

I saw that on 4.x some bugs are fixed about VM migrations, but this is 
not the point here.
I checked my SANs, my network layers, my blades, the OS (CentOS 7.2) of 
my hosts, but I see nothing special.

The real reason behind my message is not to know how to repair anything, 
rather than to understand what could have lead to this situation?
Where to keep a keen eye?

-- 
Nicolas ECARNOT

Nicolas Ecarnot

Nicolas Ecarnot

Yaniv Kaul

tags

participants (2)