[ovirt-users] One RHEV Virtual Machine does not Automatically Resume following Compellent SAN Controller Failover

Wed Nov 25 11:15:45 UTC 2015

Adding Nir who knows it far better than me.

On Mon, Nov 23, 2015 at 8:37 PM, Duckworth, Douglas C <duckd at tulane.edu>
wrote:

> Hello --
>
> Not sure if y'all can help with this issue we've been seeing with RHEV...
>
> On 11/13/2015, during Code Upgrade of Compellent SAN at our Disaster
> Recovery Site, we Failed Over to Secondary SAN Controller.  Most Virtual
> Machines in our DR Cluster Resumed automatically after Pausing except VM
> "BADVM" on Host "BADHOST."
>
> In Engine.log you can see that BADVM was sent into "VM_PAUSED_EIO" state
> at 10:47:57:
>
> "VM BADVM has paused due to storage I/O problem."
>
> On this Red Hat Enterprise Virtualization Hypervisor 6.6
> (20150512.0.el6ev) Host, two other VMs paused but then automatically
> resumed without System Administrator intervention...
>
> In our DR Cluster, 22 VMs also resumed automatically...
>
> None of these Guest VMs are engaged in high I/O as these are DR site VMs
> not currently doing anything.
>
> We sent this information to Dell.  Their response:
>
> "The root cause may reside within your virtualization solution, not the
> parent OS (RHEV-Hypervisor disc) or Storage (Dell Compellent.)"
>
> We are doing this Failover again on Sunday November 29th so we would
> like to know how to mitigate this issue, given we have to manually
> resume paused VMs that don't resume automatically.
>
> Before we initiated SAN Controller Failover, all iSCSI paths to Targets
> were present on Host tulhv2p03.
>
> VM logs on Host show in /var/log/libvirt/qemu/badhost.log that Storage
> error was reported:
>
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
>
> All disks used by this Guest VM are provided by single Storage Domain
> COM_3TB4_DR with serial "270."  In syslog we do see that all paths for
> that Storage Domain Failed:
>
> Nov 13 16:47:40 multipathd: 36000d310005caf000000000000000270: remaining
> active paths: 0
>
> Though these recovered later:
>
> Nov 13 16:59:17 multipathd: 36000d310005caf000000000000000270: sdbg -
> tur checker reports path is up
> Nov 13 16:59:17 multipathd: 36000d310005caf000000000000000270: remaining
> active paths: 8
>
> Does anyone have an idea of why the VM would fail to automatically
> resume if the iSCSI paths used by its Storage Domain recovered?
>
> Thanks
> Doug
>
> --
> Thanks
>
> Douglas Charles Duckworth
> Unix Administrator
> Tulane University
> Technology Services
> 1555 Poydras Ave
> NOLA -- 70112
>
> E: duckd at tulane.edu
> O: 504-988-9341
> F: 504-988-8505
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20151125/e49a4472/attachment-0001.html>