[ovirt-users] One RHEV Virtual Machine does not Automatically Resume following Compellent SAN Controller Failover

Mon May 30 07:43:18 UTC 2016

We see exactly the same, and it does not seem to be Vendor dependend.

- Equallogic Controller Failover -> VM get paused and maybe unpaused but
most dont
- Nexenta ZFS iSCSI with RSF1 HA -> same
- FreeBSD ctld iscsi-target + Heartbeat -> same
- CentOS + iscsi-target + Heartbeat -> same

Multipath Settings are, where available, modified to match the best
practice supplied by the Vendor. On Open Source Solutions we started
with known working multipath/iscsi Settings, and meanwhile nearly every
possible setting has been tested. Without much success.

To me it looks like Ovirt/Rhev is way to sensitive to iSCSI
Interruptions, and it feels like gambling what the engine might do to
your VM (or not).

Am 11/23/2015 um 8:37 PM schrieb Duckworth, Douglas C:
> Hello --
> 
> Not sure if y'all can help with this issue we've been seeing with RHEV...
> 
> On 11/13/2015, during Code Upgrade of Compellent SAN at our Disaster
> Recovery Site, we Failed Over to Secondary SAN Controller.  Most Virtual
> Machines in our DR Cluster Resumed automatically after Pausing except VM
> "BADVM" on Host "BADHOST."
> 
> In Engine.log you can see that BADVM was sent into "VM_PAUSED_EIO" state
> at 10:47:57:
> 
> "VM BADVM has paused due to storage I/O problem."
> 
> On this Red Hat Enterprise Virtualization Hypervisor 6.6
> (20150512.0.el6ev) Host, two other VMs paused but then automatically
> resumed without System Administrator intervention...
> 
> In our DR Cluster, 22 VMs also resumed automatically...
> 
> None of these Guest VMs are engaged in high I/O as these are DR site VMs
> not currently doing anything.
> 
> We sent this information to Dell.  Their response:
> 
> "The root cause may reside within your virtualization solution, not the
> parent OS (RHEV-Hypervisor disc) or Storage (Dell Compellent.)"
> 
> We are doing this Failover again on Sunday November 29th so we would
> like to know how to mitigate this issue, given we have to manually
> resume paused VMs that don't resume automatically.
> 
> Before we initiated SAN Controller Failover, all iSCSI paths to Targets
> were present on Host tulhv2p03.
> 
> VM logs on Host show in /var/log/libvirt/qemu/badhost.log that Storage
> error was reported:
> 
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
> 
> All disks used by this Guest VM are provided by single Storage Domain
> COM_3TB4_DR with serial "270."  In syslog we do see that all paths for
> that Storage Domain Failed:
> 
> Nov 13 16:47:40 multipathd: 36000d310005caf000000000000000270: remaining
> active paths: 0
> 
> Though these recovered later:
> 
> Nov 13 16:59:17 multipathd: 36000d310005caf000000000000000270: sdbg -
> tur checker reports path is up
> Nov 13 16:59:17 multipathd: 36000d310005caf000000000000000270: remaining
> active paths: 8
> 
> Does anyone have an idea of why the VM would fail to automatically
> resume if the iSCSI paths used by its Storage Domain recovered?
> 
> Thanks
> Doug
>