New subject: oVirt Instability with Dell Compellent via iSCSI/Multipath

20 May 2015

      --------------090603010402080805030904
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit

Engine: oVirt Engine Version: 3.5.2-1.el7.centos
Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos
Remote storage: Dell Compellent SC8000
Storage setup: 2 nics connected to the Compellent. Several domains 
backed by LUNs. Several VM disk using direct LUN.
Networking: Dell 10 Gb/s switches

I've been struggling with oVirt completely falling apart due to storage 
related issues. By falling apart I mean most to all of the nodes 
suddenly losing contact with the storage domains. This results in an 
endless loop of the VMs on the failed nodes trying to be migrated and 
remigrated as the nodes flap between response and unresponsive. During 
these times, engine.log looks like this.

2015-05-19 03:09:42,443 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-50) domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: 
blade6c1.ism.ld
2015-05-19 03:09:42,560 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-38) domain 
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: 
blade2c1.ism.ld
2015-05-19 03:09:45,497 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-24) domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: 
blade3c2.ism.ld
2015-05-19 03:09:51,713 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-46) domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: 
blade4c2.ism.ld
2015-05-19 03:09:57,647 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-13) Domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. 
vds: blade6c1.ism.ld
2015-05-19 03:09:57,782 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-6) domain 
26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: 
blade2c1.ism.ld
2015-05-19 03:09:57,783 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-6) Domain 
0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. 
vds: blade2c1.ism.ld
2015-05-19 03:10:00,639 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-31) Domain 
c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. 
vds: blade4c1.ism.ld
2015-05-19 03:10:00,703 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-17) domain 
64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: 
blade1c1.ism.ld
2015-05-19 03:10:00,712 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-4) Domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. 
vds: blade3c2.ism.ld
2015-05-19 03:10:06,931 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. 
vds: blade4c2.ism.ld
2015-05-19 03:10:06,931 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from 
problem. No active host in the DC is reporting it as problematic, so 
clearing the domain recovery timer.
2015-05-19 03:10:06,932 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. 
vds: blade4c2.ism.ld
2015-05-19 03:10:06,933 INFO 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-48) Domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from 
problem. No active host in the DC is reporting it as problematic, so 
clearing the domain recovery timer.
2015-05-19 03:10:09,929 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] 
(org.ovirt.thread.pool-8-thread-16) domain 
b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: 
blade3c1.ism.ld

My troubleshooting steps so far:

 1. Tailing engine.log for "in problem" and "recovered from problem"
 2. Shutting down all the VMs.
 3. Shutting down all but one node.
 4. Bringing up one node at a time to see what the log reports.

When only one node is active everything is fine. When a second node 
comes up, I begin to see the log output as shown above. I've been 
struggling with this for over a month. I'm sure others have used oVirt 
with a Compellent and encountered (and worked around) similar problems. 
I'm looking for some help in figuring out if it's oVirt or something 
that I'm doing wrong.

We're close to giving up on oVirt completely because of this.

P.S.

I've tested via bare metal and Proxmox with the Compellent. Not at the 
same scale but it seems to work fine there.

-- 
This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin.

--------------090603010402080805030904
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: 8bit

<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Engine:
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    oVirt Engine Version: 3.5.2-1.el7.centos<br>
    Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos<br>
    Remote storage: Dell Compellent SC8000<br>
    Storage setup: 2 nics connected to the Compellent. Several domains
    backed by LUNs. Several VM disk using direct LUN.<br>
    Networking: Dell 10 Gb/s switches<br>
    <br>
    I've been struggling with oVirt completely falling apart due to
    storage related issues. By falling apart I mean most to all of the
    nodes suddenly losing contact with the storage domains. This results
    in an endless loop of the VMs on the failed nodes trying to be
    migrated and remigrated as the nodes flap between response and
    unresponsive. During these times, engine.log looks like this.<br>
    <br>
    2015-05-19 03:09:42,443 WARN 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-50) domain
    c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds:
    blade6c1.ism.ld<br>
    2015-05-19 03:09:42,560 WARN 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-38) domain
    0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds:
    blade2c1.ism.ld<br>
    2015-05-19 03:09:45,497 WARN 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-24) domain
    05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds:
    blade3c2.ism.ld<br>
    2015-05-19 03:09:51,713 WARN 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-46) domain
    b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
    blade4c2.ism.ld<br>
    2015-05-19 03:09:57,647 INFO 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-13) Domain
    c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from
    problem. vds: blade6c1.ism.ld<br>
    2015-05-19 03:09:57,782 WARN 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-6) domain
    26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds:
    blade2c1.ism.ld<br>
    2015-05-19 03:09:57,783 INFO 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-6) Domain
    0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from
    problem. vds: blade2c1.ism.ld<br>
    2015-05-19 03:10:00,639 INFO 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-31) Domain
    c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from
    problem. vds: blade4c1.ism.ld<br>
    2015-05-19 03:10:00,703 WARN 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-17) domain
    64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds:
    blade1c1.ism.ld<br>
    2015-05-19 03:10:00,712 INFO 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-4) Domain
    05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from
    problem. vds: blade3c2.ism.ld<br>
    2015-05-19 03:10:06,931 INFO 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-48) Domain
    05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from
    problem. vds: blade4c2.ism.ld<br>
    2015-05-19 03:10:06,931 INFO 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-48) Domain
    05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from
    problem. No active host in the DC is reporting it as problematic, so
    clearing the domain recovery timer.<br>
    2015-05-19 03:10:06,932 INFO 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-48) Domain
    b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from
    problem. vds: blade4c2.ism.ld<br>
    2015-05-19 03:10:06,933 INFO 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-48) Domain
    b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from
    problem. No active host in the DC is reporting it as problematic, so
    clearing the domain recovery timer.<br>
    2015-05-19 03:10:09,929 WARN 
    [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
    (org.ovirt.thread.pool-8-thread-16) domain
    b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
    blade3c1.ism.ld<br>
    <br>
    <br>
    My troubleshooting steps so far:<br>
    <ol>
      <li>Tailing engine.log for "in problem" and "recovered from
        problem"</li>
      <li>Shutting down all the VMs.</li>
      <li>Shutting down all but one node.</li>
      <li>Bringing up one node at a time to see what the log reports.</li>
    </ol>
    <p>When only one node is active everything is fine. When a second
      node comes up, I begin to see the log output as shown above. I've
      been struggling with this for over a month. I'm sure others have
      used oVirt with a Compellent and encountered (and worked around)
      similar problems. I'm looking for some help in figuring out if
      it's oVirt or something that I'm doing wrong.<br>
    </p>
    <p>We're close to giving up on oVirt completely because of this.<br>
    </p>
    <p>P.S.<br>
    </p>
    <p>I've tested via bare metal and Proxmox with the Compellent. Not
      at the same scale but it seems to work fine there.<br>
    </p>
  </body>
</html>
<pre>
-- 
This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin.</pre>

--------------090603010402080805030904--

oVirt Instability with Dell Compellent via iSCSI/Multipath

Chris Jones - BookIt.com Systems Administrator

Douglas Schilling Landgraf

Fabian Deutsch

Chris Jones - BookIt.com Systems Administrator

Nir Soffer

Douglas Schilling Landgraf

Chris Jones - BookIt.com Systems Administrator

Chris Jones - BookIt.com Systems Administrator

Jorick Astrego

Daniel Helgenberger

Chris Jones - BookIt.com Systems Administrator

Chris Jones - BookIt.com Systems Administrator

Fabian Deutsch

Chris Jones - BookIt.com Systems Administrator

Nir Soffer

Chris Jones - BookIt.com Systems Administrator

Nir Soffer

Nir Soffer

Andrea Ghelardi

Chris Jones - BookIt.com Systems Administrator

Chris Jones - BookIt.com Systems Administrator

tags

participants (7)