<div dir="auto"><div><br><div class="gmail_extra"><br><div class="gmail_quote">On Feb 7, 2018 7:08 PM, "Nicolas Ecarnot" <<a href="mailto:nicolas@ecarnot.net">nicolas@ecarnot.net</a>> wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,<br>
<br>
TL; DR : qcow2 images keep getting corrupted. Any workaround?<br>
<br>
Long version:<br>
This discussion has already been launched by me on the oVirt and on qemu-block mailing list, under similar circumstances but I learned further things since months and here are some informations :<br>
<br>
- We are using 2 oVirt 3.6.7.5-1.el7.centos datacenters, using CentOS 7.{2,3} hosts<br>
- Hosts :<br>
- CentOS 7.2 1511 :<br>
- Kernel = 3.10.0 327<br>
- KVM : 2.3.0-31<br>
- libvirt : 1.2.17<br>
- vdsm : 4.17.32-1<br>
- CentOS 7.3 1611 :<br>
- Kernel 3.10.0 514<br>
- KVM : 2.3.0-31<br>
- libvirt 2.0.0-10<br>
- vdsm : 4.17.32-1<br></blockquote></div></div></div><div dir="auto"><br></div><div dir="auto">All are somewhat old releases. I suggest upgrading to the latest RHEL and qemu-kvm bits. </div><div dir="auto"><br></div><div dir="auto">Later on, upgrade oVirt. </div><div dir="auto">Y. </div><div dir="auto"><br></div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
- Our storage is 2 Equallogic SANs connected via iSCSI on a dedicated network<br>
- Depends on weeks, but all in all, there are around 32 hosts, 8 storage domains and for various reasons, very few VMs (less than 200).<br>
- One peculiar point is that most of our VMs are provided an additional dedicated network interface that is iSCSI-connected to some volumes of our SAN - these volumes not being part of the oVirt setup. That could lead to a lot of additional iSCSI traffic.<br>
<br>
>From times to times, a random VM appears paused by oVirt.<br>
Digging into the oVirt engine logs, then into the host vdsm logs, it appears that the host considers the qcow2 image as corrupted.<br>
Along what I consider as a conservative behavior, vdsm stops any interaction with this image and marks it as paused.<br>
Any try to unpause it leads to the same conservative pause.<br>
<br>
After having found (<a href="https://access.redhat.com/solutions/1173623" rel="noreferrer" target="_blank">https://access.redhat.com/sol<wbr>utions/1173623</a>) the right logical volume hosting the qcow2 image, I can run qemu-img check on it.<br>
- On 80% of my VMs, I find no errors.<br>
- On 15% of them, I find Leaked cluster errors that I can correct using "qemu-img check -r all"<br>
- On 5% of them, I find Leaked clusters errors and further fatal errors, which can not be corrected with qemu-img.<br>
In rare cases, qemu-img can correct them, but destroys large parts of the image (becomes unusable), and on other cases it can not correct them at all.<br>
<br>
Months ago, I already sent a similar message but the error message was about No space left on device (<a href="https://www.mail-archive.com/qemu-block@gnu.org/msg00110.html" rel="noreferrer" target="_blank">https://www.mail-archive.com/<wbr>qemu-block@gnu.org/msg00110.ht<wbr>ml</a>).<br>
<br>
This time, I don't have this message about space, but only corruption.<br>
<br>
I kept reading and found a similar discussion in the Proxmox group :<br>
<a href="https://lists.ovirt.org/pipermail/users/2018-February/086750.html" rel="noreferrer" target="_blank">https://lists.ovirt.org/piperm<wbr>ail/users/2018-February/086750<wbr>.html</a><br>
<br>
<a href="https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/page-2" rel="noreferrer" target="_blank">https://forum.proxmox.com/thre<wbr>ads/qcow2-corruption-after-<wbr>snapshot-or-heavy-disk-i-o.<wbr>32865/page-2</a><br>
<br>
What I read similar to my case is :<br>
- usage of qcow2<br>
- heavy disk I/O<br>
- using the virtio-blk driver<br>
<br>
In the proxmox thread, they tend to say that using virtio-scsi is the solution. Having asked this question to oVirt experts (<a href="https://lists.ovirt.org/pipermail/users/2018-February/086753.html" rel="noreferrer" target="_blank">https://lists.ovirt.org/piper<wbr>mail/users/2018-February/08675<wbr>3.html</a>) but it's not clear the driver is to blame.<br>
<br>
I agree with the answer Yaniv Kaul gave to me, saying I have to properly report the issue, so I'm longing to know which peculiar information I can give you now.<br>
<br>
As you can imagine, all this setup is in production, and for most of the VMs, I can not "play" with them. Moreover, we launched a campaign of nightly stopping every VM, qemu-img check them one by one, then boot.<br>
So it might take some time before I find another corrupted image.<br>
(which I'll preciously store for debug)<br>
<br>
Other informations : We very rarely do snapshots, but I'm close to imagine that automated migrations of VMs could trigger similar behaviors on qcow2 images.<br>
<br>
Last point about the versions we use : yes that's old, yes we're planning to upgrade, but we don't know when.<br>
<br>
Regards,<font color="#888888"><br>
<br>
-- <br>
Nicolas ECARNOT<br>
______________________________<wbr>_________________<br>
Users mailing list<br>
<a href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br>
<a href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman<wbr>/listinfo/users</a><br>
</font></blockquote></div><br></div></div></div>