[ovirt-users] One RHEV Virtual Machine does not Automatically Resume following Compellent SAN Controller Failover
InterNetX - Juergen Gotteswinter
jg at internetx.com
Mon May 30 03:43:18 EDT 2016
We see exactly the same, and it does not seem to be Vendor dependend.
- Equallogic Controller Failover -> VM get paused and maybe unpaused but
most dont
- Nexenta ZFS iSCSI with RSF1 HA -> same
- FreeBSD ctld iscsi-target + Heartbeat -> same
- CentOS + iscsi-target + Heartbeat -> same
Multipath Settings are, where available, modified to match the best
practice supplied by the Vendor. On Open Source Solutions we started
with known working multipath/iscsi Settings, and meanwhile nearly every
possible setting has been tested. Without much success.
To me it looks like Ovirt/Rhev is way to sensitive to iSCSI
Interruptions, and it feels like gambling what the engine might do to
your VM (or not).
Am 11/23/2015 um 8:37 PM schrieb Duckworth, Douglas C:
> Hello --
>
> Not sure if y'all can help with this issue we've been seeing with RHEV...
>
> On 11/13/2015, during Code Upgrade of Compellent SAN at our Disaster
> Recovery Site, we Failed Over to Secondary SAN Controller. Most Virtual
> Machines in our DR Cluster Resumed automatically after Pausing except VM
> "BADVM" on Host "BADHOST."
>
> In Engine.log you can see that BADVM was sent into "VM_PAUSED_EIO" state
> at 10:47:57:
>
> "VM BADVM has paused due to storage I/O problem."
>
> On this Red Hat Enterprise Virtualization Hypervisor 6.6
> (20150512.0.el6ev) Host, two other VMs paused but then automatically
> resumed without System Administrator intervention...
>
> In our DR Cluster, 22 VMs also resumed automatically...
>
> None of these Guest VMs are engaged in high I/O as these are DR site VMs
> not currently doing anything.
>
> We sent this information to Dell. Their response:
>
> "The root cause may reside within your virtualization solution, not the
> parent OS (RHEV-Hypervisor disc) or Storage (Dell Compellent.)"
>
> We are doing this Failover again on Sunday November 29th so we would
> like to know how to mitigate this issue, given we have to manually
> resume paused VMs that don't resume automatically.
>
> Before we initiated SAN Controller Failover, all iSCSI paths to Targets
> were present on Host tulhv2p03.
>
> VM logs on Host show in /var/log/libvirt/qemu/badhost.log that Storage
> error was reported:
>
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
>
> All disks used by this Guest VM are provided by single Storage Domain
> COM_3TB4_DR with serial "270." In syslog we do see that all paths for
> that Storage Domain Failed:
>
> Nov 13 16:47:40 multipathd: 36000d310005caf000000000000000270: remaining
> active paths: 0
>
> Though these recovered later:
>
> Nov 13 16:59:17 multipathd: 36000d310005caf000000000000000270: sdbg -
> tur checker reports path is up
> Nov 13 16:59:17 multipathd: 36000d310005caf000000000000000270: remaining
> active paths: 8
>
> Does anyone have an idea of why the VM would fail to automatically
> resume if the iSCSI paths used by its Storage Domain recovered?
>
> Thanks
> Doug
>
More information about the Users
mailing list