On Feb 6, 2018 11:09 AM, "Nicolas Ecarnot" <nicolas@ecarnot.net> wrote:
Hello,

On our two 3.6 DCs, we're still facing qcow2 corruptions, even on freshly installed VMs (CentOS7, win2012, win2008...).

Please provide complete information on the issue. When, how often, which storage, etc. 


(We are still hoping to find some time to migrate all this to 4.2, but it's a big work and our one-person team - me - is overwhelmed.)

Understood. Note that we have some scripts that can assist somewhat. 


My workaround is described in my previous thread below, but it's just a workaround.

Reading further, I found that :

https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/page-2

There are many things I don't know or understand, and I'd like your opinion :

- Is "virtio" is synonym of "virtio-blk"?

Yes. 

- Is it true that the development of virtio-scsi is active and the one of virtio is stopped?

No. 

- People in the proxmox forum seem to say that no qcow2 corruption occurs when using IDE (not an option for me) neither virtio-scsi.

Anecdotal evidence or properly reproduced? 
Have they filed an issue? 

Does any Redhat people ever heard of this?

I'm not aware of an existing corruption issue. 

- Is converting all my VMs to use virtio-scsi a guarantee against further corruptions?

No. 

- What is the non-official but nonetheless recommended driver oVirt devs recommend in the sense of future, development and stability?

Depends. I like virtio-scsi for its features (DISCARD mainly), but in some workloads virtio-blk may be somewhat faster (supposedly lower overhead). 
Both interfaces are stable. 

We should focus on properly reporting the issue so the qemu folks can look at this. 
Y. 


Regards,

--
Nicolas ECARNOT


Le 15/09/2017 à 14:06, Nicolas Ecarnot a écrit :
TL;DR:
How to avoid images corruption?


Hello,

On two of our old 3.6 DC, a recent series of VM migrations lead to some issues :
- I'm putting a host into maintenance mode
- most of the VM are migrating nicely
- one remaining VM never migrates, and the logs are showing :

* engine.log : "...VM has been paused due to I/O error..."
* vdsm.log : "...Improbable extension request for volume..."

After digging amongst the RH BZ tickets, I saved the day by :
- stopping the VM
- lvchange -ay the adequate /dev/...
- qemu-img check [-r all] /rhev/blahblah
- lvchange -an...
- boot the VM
- enjoy!

Yesterday this worked for a VM where only one error occurred on the qemu image, and the repair was easily done by qemu-img.

Today, facing the same issue on another VM, it failed because the errors were very numerous, and also because of this message :

[...]
Rebuilding refcount structure
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device
[...]

The PV/VG/LV are far from being full, so I guess I don't where to look at.
I tried many ways to solve it but I'm not comfortable at all with qemu images, corruption and solving, so I ended up exporting this VM (to an NFS export domain), importing it into another DC : this had the side effect to use qemu-img convert from qcow2 to qcow2, and (maybe?????) to solve some errors???
I also copied it into another qcow2 file with the same qemu-img convert way, but it is leading to another clean qcow2 image without errors.

I saw that on 4.x some bugs are fixed about VM migrations, but this is not the point here.
I checked my SANs, my network layers, my blades, the OS (CentOS 7.2) of my hosts, but I see nothing special.

The real reason behind my message is not to know how to repair anything, rather than to understand what could have lead to this situation?
Where to keep a keen eye?



--
Nicolas ECARNOT
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users