oVirt Instability with Dell Compellent via iSCSI/Multipath

--------------090603010402080805030904 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Engine: oVirt Engine Version: 3.5.2-1.el7.centos Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos Remote storage: Dell Compellent SC8000 Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN. Networking: Dell 10 Gb/s switches I've been struggling with oVirt completely falling apart due to storage related issues. By falling apart I mean most to all of the nodes suddenly losing contact with the storage domains. This results in an endless loop of the VMs on the failed nodes trying to be migrated and remigrated as the nodes flap between response and unresponsive. During these times, engine.log looks like this. 2015-05-19 03:09:42,443 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-50) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c1.ism.ld 2015-05-19 03:09:42,560 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:45,497 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-24) domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: blade3c2.ism.ld 2015-05-19 03:09:51,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-46) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade4c2.ism.ld 2015-05-19 03:09:57,647 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c1.ism.ld 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld 2015-05-19 03:10:00,712 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade3c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:06,932 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,933 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:09,929 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-16) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade3c1.ism.ld My troubleshooting steps so far: 1. Tailing engine.log for "in problem" and "recovered from problem" 2. Shutting down all the VMs. 3. Shutting down all but one node. 4. Bringing up one node at a time to see what the log reports. When only one node is active everything is fine. When a second node comes up, I begin to see the log output as shown above. I've been struggling with this for over a month. I'm sure others have used oVirt with a Compellent and encountered (and worked around) similar problems. I'm looking for some help in figuring out if it's oVirt or something that I'm doing wrong. We're close to giving up on oVirt completely because of this. P.S. I've tested via bare metal and Proxmox with the Compellent. Not at the same scale but it seems to work fine there. -- This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin. --------------090603010402080805030904 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: 8bit <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> </head> <body bgcolor="#FFFFFF" text="#000000"> Engine: <meta http-equiv="content-type" content="text/html; charset=utf-8"> oVirt Engine Version: 3.5.2-1.el7.centos<br> Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos<br> Remote storage: Dell Compellent SC8000<br> Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN.<br> Networking: Dell 10 Gb/s switches<br> <br> I've been struggling with oVirt completely falling apart due to storage related issues. By falling apart I mean most to all of the nodes suddenly losing contact with the storage domains. This results in an endless loop of the VMs on the failed nodes trying to be migrated and remigrated as the nodes flap between response and unresponsive. During these times, engine.log looks like this.<br> <br> 2015-05-19 03:09:42,443 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-50) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c1.ism.ld<br> 2015-05-19 03:09:42,560 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: blade2c1.ism.ld<br> 2015-05-19 03:09:45,497 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-24) domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: blade3c2.ism.ld<br> 2015-05-19 03:09:51,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-46) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade4c2.ism.ld<br> 2015-05-19 03:09:57,647 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c1.ism.ld<br> 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld<br> 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld<br> 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld<br> 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld<br> 2015-05-19 03:10:00,712 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade3c2.ism.ld<br> 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade4c2.ism.ld<br> 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.<br> 2015-05-19 03:10:06,932 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. vds: blade4c2.ism.ld<br> 2015-05-19 03:10:06,933 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.<br> 2015-05-19 03:10:09,929 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-16) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade3c1.ism.ld<br> <br> <br> My troubleshooting steps so far:<br> <ol> <li>Tailing engine.log for "in problem" and "recovered from problem"</li> <li>Shutting down all the VMs.</li> <li>Shutting down all but one node.</li> <li>Bringing up one node at a time to see what the log reports.</li> </ol> <p>When only one node is active everything is fine. When a second node comes up, I begin to see the log output as shown above. I've been struggling with this for over a month. I'm sure others have used oVirt with a Compellent and encountered (and worked around) similar problems. I'm looking for some help in figuring out if it's oVirt or something that I'm doing wrong.<br> </p> <p>We're close to giving up on oVirt completely because of this.<br> </p> <p>P.S.<br> </p> <p>I've tested via bare metal and Proxmox with the Compellent. Not at the same scale but it seems to work fine there.<br> </p> </body> </html> <pre> -- This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin.</pre> --------------090603010402080805030904--

Hello Chris, On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator wrote:
Engine: oVirt Engine Version: 3.5.2-1.el7.centos Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos Remote storage: Dell Compellent SC8000 Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN. Networking: Dell 10 Gb/s switches
I've been struggling with oVirt completely falling apart due to storage related issues. By falling apart I mean most to all of the nodes suddenly losing contact with the storage domains. This results in an endless loop of the VMs on the failed nodes trying to be migrated and remigrated as the nodes flap between response and unresponsive. During these times, engine.log looks like this.
2015-05-19 03:09:42,443 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-50) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c1.ism.ld 2015-05-19 03:09:42,560 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:45,497 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-24) domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: blade3c2.ism.ld 2015-05-19 03:09:51,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-46) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade4c2.ism.ld 2015-05-19 03:09:57,647 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c1.ism.ld 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld 2015-05-19 03:10:00,712 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade3c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:06,932 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,933 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:09,929 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-16) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade3c1.ism.ld
My troubleshooting steps so far:
1. Tailing engine.log for "in problem" and "recovered from problem" 2. Shutting down all the VMs. 3. Shutting down all but one node. 4. Bringing up one node at a time to see what the log reports.
vdsm.log in the node side, will help here too.
When only one node is active everything is fine. When a second node comes up, I begin to see the log output as shown above. I've been struggling with this for over a month. I'm sure others have used oVirt with a Compellent and encountered (and worked around) similar problems. I'm looking for some help in figuring out if it's oVirt or something that I'm doing wrong.
We're close to giving up on oVirt completely because of this.
P.S.
I've tested via bare metal and Proxmox with the Compellent. Not at the same scale but it seems to work fine there.
Do you mind to share your vdsm.log from the oVirt Node machine? To go to console in oVirt Node, press F2 in TUI. Files: /var/log/vdsm/vdsm*.log # rpm -qa | grep -i vdsm might help too. Thanks! -- Cheers Douglas

----- Original Message -----
Hello Chris,
On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator wrote:
Engine: oVirt Engine Version: 3.5.2-1.el7.centos Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos Remote storage: Dell Compellent SC8000 Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN. Networking: Dell 10 Gb/s switches
I've been struggling with oVirt completely falling apart due to storage related issues. By falling apart I mean most to all of the nodes suddenly losing contact with the storage domains. This results in an endless loop of the VMs on the failed nodes trying to be migrated and remigrated as the nodes flap between response and unresponsive. During these times, engine.log looks like this.
2015-05-19 03:09:42,443 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-50) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c1.ism.ld 2015-05-19 03:09:42,560 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:45,497 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-24) domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: blade3c2.ism.ld 2015-05-19 03:09:51,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-46) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade4c2.ism.ld 2015-05-19 03:09:57,647 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c1.ism.ld 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld 2015-05-19 03:10:00,712 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade3c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:06,932 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,933 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:09,929 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-16) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade3c1.ism.ld
My troubleshooting steps so far:
1. Tailing engine.log for "in problem" and "recovered from problem" 2. Shutting down all the VMs. 3. Shutting down all but one node. 4. Bringing up one node at a time to see what the log reports.
vdsm.log in the node side, will help here too.
When only one node is active everything is fine. When a second node comes up, I begin to see the log output as shown above. I've been struggling with this for over a month. I'm sure others have used oVirt with a Compellent and encountered (and worked around) similar problems. I'm looking for some help in figuring out if it's oVirt or something that I'm doing wrong.
We're close to giving up on oVirt completely because of this.
P.S.
I've tested via bare metal and Proxmox with the Compellent. Not at the same scale but it seems to work fine there.
Do you mind to share your vdsm.log from the oVirt Node machine?
To go to console in oVirt Node, press F2 in TUI. Files: /var/log/vdsm/vdsm*.log
# rpm -qa | grep -i vdsm might help too.
Hey Chris, please open a bug [1] for this, then we can track it and we can help to identify the issue. I'm not seeing anything suspicious from the logs above. But adding the logs Douglas mentions above to the new bug will help us to get more insight. Greetings fabian --- [1] https://bugzilla.redhat.com/enter_bug.cgi?product=oVirt&component=ovirt-node

vdsm.log in the node side, will help here too.
https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log contains only the messages at and after when a host was become unresponsive due to storage issues.
# rpm -qa | grep -i vdsm might help too.
vdsm-cli-4.16.14-0.el7.noarch vdsm-reg-4.16.14-0.el7.noarch ovirt-node-plugin-vdsm-0.2.2-5.el7.noarch vdsm-python-zombiereaper-4.16.14-0.el7.noarch vdsm-xmlrpc-4.16.14-0.el7.noarch vdsm-yajsonrpc-4.16.14-0.el7.noarch vdsm-4.16.14-0.el7.x86_64 vdsm-gluster-4.16.14-0.el7.noarch vdsm-hook-ethtool-options-4.16.14-0.el7.noarch vdsm-python-4.16.14-0.el7.noarch vdsm-jsonrpc-4.16.14-0.el7.noarch
Hey Chris,
please open a bug [1] for this, then we can track it and we can help to identify the issue.
I will do so.

----- Original Message -----
From: "Chris Jones - BookIt.com Systems Administrator" <chris.jones@bookit.com> To: users@ovirt.org Sent: Thursday, May 21, 2015 12:49:50 AM Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
vdsm.log in the node side, will help here too.
https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log contains only the messages at and after when a host was become unresponsive due to storage issues.
According to the log, you have a real issue accessing storage from the host: [nsoffer@thin untitled (master)]$ repostat vdsm.log domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2 delay avg: 0.000856 min: 0.000000 max: 0.001168 last check avg: 11.510000 min: 0.300000 max: 64.100000 domain: 64101f40-0f10-471d-9f5f-44591f9e087d delay avg: 0.008358 min: 0.000000 max: 0.040269 last check avg: 11.863333 min: 0.300000 max: 63.400000 domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0 delay avg: 0.007793 min: 0.000819 max: 0.041316 last check avg: 11.466667 min: 0.000000 max: 70.200000 domain: 842edf83-22c6-46cd-acaa-a1f76d61e545 delay avg: 0.000493 min: 0.000374 max: 0.000698 last check avg: 4.860000 min: 0.200000 max: 9.900000 domain: b050c455-5ab1-4107-b055-bfcc811195fc delay avg: 0.002080 min: 0.000000 max: 0.040142 last check avg: 11.830000 min: 0.000000 max: 63.700000 domain: c46adffc-614a-4fa2-9d2d-954f174f6a39 delay avg: 0.004798 min: 0.000000 max: 0.041006 last check avg: 18.423333 min: 1.400000 max: 102.900000 domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7 delay avg: 0.001002 min: 0.000000 max: 0.001199 last check avg: 11.560000 min: 0.300000 max: 61.700000 domain: 20153412-f77a-4944-b252-ff06a78a1d64 delay avg: 0.003748 min: 0.000000 max: 0.040903 last check avg: 12.180000 min: 0.000000 max: 67.200000 domain: 26929b89-d1ca-4718-90d6-b3a6da585451 delay avg: 0.000963 min: 0.000000 max: 0.001209 last check avg: 10.993333 min: 0.000000 max: 64.300000 domain: 0137183b-ea40-49b1-b617-256f47367280 delay avg: 0.000881 min: 0.000000 max: 0.001227 last check avg: 11.086667 min: 0.100000 max: 63.200000 Note the high last check maximum value (e.g. 102 seconds). Vdsm has a monitor thread for each domain, doing a read from one of the storage domain special disk every 10 seconds. When we see high last check value, it means that the monitor thread is stuck reading from the disk. This is an indicator that vms may have trouble accessing this storage domains, and engine is handling this by making the host non-operational, or if all hosts cannot access the domain, making the domain inactive. One of the known issues that can be related, is bad multipath configuration. Some storage server have bad builtin configuration embedded into multipath. In particular, using "no_path_retry queue", or "no_path_retry 60". This setting means that when the SCSI layer fails, and multipath does not have any active path it will queue io foerver (queue), or retry many times (e.g, 60) before failing the io request. This can lead to stuck process, doing a read or write that never fails or takes many minutes to fail. Vdsm is not designed to handle such delays - a stuck thread may block other unrelated threads. Vdsm includes special configuration for your storage vendor (COMPELNT), but maybe it does not match the product (Compellent Vol). See https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/configurators/multip... device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail } Another issue may be that the setting for COMPELNT/Compellent Vol are wrong; the setting we ship is missing lot of settings that exists in the builtin setting, and this may have bad effect. If your devices match this , I would try this multipath configuration, instead of the one vdsm configures. device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback "immediate" rr_weight "uniform" no_path_retry fail } To verify that your devices match this, you can check the devices vendor and procut strings in the output of "multipath -ll". I would like to see the output of this command. Another platform issue is bad default SCSI node.session.timeo.replacement_timeout value, which is set to 120 seconds. This setting mean that the SCSI layer will wait 120 seconds for io to complete on one path, before failing the io request. So you may have one bad path, causing 120 second delay, while you could complete the request using another path. Multipath is trying to set this value to 5 seconds, but this value is reverting to the default 120 seconds after a device has trouble. There is an open bug about this which we hope to get fixed in the rhel/centos 7.2. https://bugzilla.redhat.com/1139038 This issue together with "no_path_retry queue" is a very bad mix for ovirt. You can fix this timeout by setting: # /etc/iscsi/iscsid.conf node.session.timeo.replacement_timeout = 5 And restarting iscsid service. With these tweaks, the issue may be resolved. I hope it helps. Nir
# rpm -qa | grep -i vdsm might help too.
vdsm-cli-4.16.14-0.el7.noarch vdsm-reg-4.16.14-0.el7.noarch ovirt-node-plugin-vdsm-0.2.2-5.el7.noarch vdsm-python-zombiereaper-4.16.14-0.el7.noarch vdsm-xmlrpc-4.16.14-0.el7.noarch vdsm-yajsonrpc-4.16.14-0.el7.noarch vdsm-4.16.14-0.el7.x86_64 vdsm-gluster-4.16.14-0.el7.noarch vdsm-hook-ethtool-options-4.16.14-0.el7.noarch vdsm-python-4.16.14-0.el7.noarch vdsm-jsonrpc-4.16.14-0.el7.noarch
Hey Chris,
please open a bug [1] for this, then we can track it and we can help to identify the issue.
I will do so.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On 05/20/2015 07:10 PM, Nir Soffer wrote:
----- Original Message -----
From: "Chris Jones - BookIt.com Systems Administrator" <chris.jones@bookit.com> To: users@ovirt.org Sent: Thursday, May 21, 2015 12:49:50 AM Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
vdsm.log in the node side, will help here too.
https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log contains only the messages at and after when a host was become unresponsive due to storage issues.
According to the log, you have a real issue accessing storage from the host:
[nsoffer@thin untitled (master)]$ repostat vdsm.log domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2 delay avg: 0.000856 min: 0.000000 max: 0.001168 last check avg: 11.510000 min: 0.300000 max: 64.100000 domain: 64101f40-0f10-471d-9f5f-44591f9e087d delay avg: 0.008358 min: 0.000000 max: 0.040269 last check avg: 11.863333 min: 0.300000 max: 63.400000 domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0 delay avg: 0.007793 min: 0.000819 max: 0.041316 last check avg: 11.466667 min: 0.000000 max: 70.200000 domain: 842edf83-22c6-46cd-acaa-a1f76d61e545 delay avg: 0.000493 min: 0.000374 max: 0.000698 last check avg: 4.860000 min: 0.200000 max: 9.900000 domain: b050c455-5ab1-4107-b055-bfcc811195fc delay avg: 0.002080 min: 0.000000 max: 0.040142 last check avg: 11.830000 min: 0.000000 max: 63.700000 domain: c46adffc-614a-4fa2-9d2d-954f174f6a39 delay avg: 0.004798 min: 0.000000 max: 0.041006 last check avg: 18.423333 min: 1.400000 max: 102.900000 domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7 delay avg: 0.001002 min: 0.000000 max: 0.001199 last check avg: 11.560000 min: 0.300000 max: 61.700000 domain: 20153412-f77a-4944-b252-ff06a78a1d64 delay avg: 0.003748 min: 0.000000 max: 0.040903 last check avg: 12.180000 min: 0.000000 max: 67.200000 domain: 26929b89-d1ca-4718-90d6-b3a6da585451 delay avg: 0.000963 min: 0.000000 max: 0.001209 last check avg: 10.993333 min: 0.000000 max: 64.300000 domain: 0137183b-ea40-49b1-b617-256f47367280 delay avg: 0.000881 min: 0.000000 max: 0.001227 last check avg: 11.086667 min: 0.100000 max: 63.200000
Note the high last check maximum value (e.g. 102 seconds).
Vdsm has a monitor thread for each domain, doing a read from one of the storage domain special disk every 10 seconds. When we see high last check value, it means that the monitor thread is stuck reading from the disk.
This is an indicator that vms may have trouble accessing this storage domains, and engine is handling this by making the host non-operational, or if all hosts cannot access the domain, making the domain inactive.
One of the known issues that can be related, is bad multipath configuration. Some storage server have bad builtin configuration embedded into multipath. In particular, using "no_path_retry queue", or "no_path_retry 60". This setting means that when the SCSI layer fails, and multipath does not have any active path it will queue io foerver (queue), or retry many times (e.g, 60) before failing the io request.
This can lead to stuck process, doing a read or write that never fails or takes many minutes to fail. Vdsm is not designed to handle such delays - a stuck thread may block other unrelated threads.
Vdsm includes special configuration for your storage vendor (COMPELNT), but maybe it does not match the product (Compellent Vol). See https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/configurators/multip...
device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail }
Another issue may be that the setting for COMPELNT/Compellent Vol are wrong; the setting we ship is missing lot of settings that exists in the builtin setting, and this may have bad effect. If your devices match this , I would try this multipath configuration, instead of the one vdsm configures.
device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback "immediate" rr_weight "uniform" no_path_retry fail }
To verify that your devices match this, you can check the devices vendor and procut strings in the output of "multipath -ll". I would like to see the output of this command.
Another platform issue is bad default SCSI node.session.timeo.replacement_timeout value, which is set to 120 seconds. This setting mean that the SCSI layer will wait 120 seconds for io to complete on one path, before failing the io request. So you may have one bad path, causing 120 second delay, while you could complete the request using another path.
Multipath is trying to set this value to 5 seconds, but this value is reverting to the default 120 seconds after a device has trouble. There is an open bug about this which we hope to get fixed in the rhel/centos 7.2. https://bugzilla.redhat.com/1139038
This issue together with "no_path_retry queue" is a very bad mix for ovirt.
You can fix this timeout by setting:
# /etc/iscsi/iscsid.conf node.session.timeo.replacement_timeout = 5
And restarting iscsid service.
Chris, as you are using ovirt-node, after Nir suggestions please also execute the below command too to save the settings changes across reboots: # persist /etc/iscsi/iscsid.conf
With these tweaks, the issue may be resolved.
I hope it helps.
Nir
# rpm -qa | grep -i vdsm might help too.
vdsm-cli-4.16.14-0.el7.noarch vdsm-reg-4.16.14-0.el7.noarch ovirt-node-plugin-vdsm-0.2.2-5.el7.noarch vdsm-python-zombiereaper-4.16.14-0.el7.noarch vdsm-xmlrpc-4.16.14-0.el7.noarch vdsm-yajsonrpc-4.16.14-0.el7.noarch vdsm-4.16.14-0.el7.x86_64 vdsm-gluster-4.16.14-0.el7.noarch vdsm-hook-ethtool-options-4.16.14-0.el7.noarch vdsm-python-4.16.14-0.el7.noarch vdsm-jsonrpc-4.16.14-0.el7.noarch
Hey Chris,
please open a bug [1] for this, then we can track it and we can help to identify the issue.
I will do so.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Cheers Douglas

Chris, as you are using ovirt-node, after Nir suggestions please also execute the below command too to save the settings changes across reboots:
# persist /etc/iscsi/iscsid.conf
Thanks. I will do so, but first I have to resolve not being able to update multipath.conf as described in my previous email.

Another issue may be that the setting for COMPELNT/Compellent Vol are wrong; the setting we ship is missing lot of settings that exists in the builtin setting, and this may have bad effect. If your devices match this , I would try this multipath configuration, instead of the one vdsm configures.
device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback "immediate" rr_weight "uniform" no_path_retry fail }
I wish I could. We're using the CentOS 7 ovirt-node-iso. The multipath.conf is less than ideal but when I tried updating it, oVirt instantly overwrites it. To be clear, yes I know changes do not survive reboots and yes I know about persist, but it changes it while running. Live! Persist won't help there. I also tried building a CentOS 7 "thick client" where I set up CentOS 7 first, added the oVirt repo, then let the engine provision it. Same problem with multipath.conf being overwritten with the default oVirt setup. So I tried to be slick about it. I made the multipath.conf immutable. That prevented the engine from being able to activate the node. It would fail on a vds command that gets the nodes capabilities and part of what it does is reads then overwrites multipath.conf. How do I safely update multipath.conf?
To verify that your devices match this, you can check the devices vendor and procut strings in the output of "multipath -ll". I would like to see the output of this command.
multipath -ll (default setup) can be seen here. http://paste.linux-help.org/view/430c7538
Another platform issue is bad default SCSI node.session.timeo.replacement_timeout value, which is set to 120 seconds. This setting mean that the SCSI layer will wait 120 seconds for io to complete on one path, before failing the io request. So you may have one bad path, causing 120 second delay, while you could complete the request using another path.
Multipath is trying to set this value to 5 seconds, but this value is reverting to the default 120 seconds after a device has trouble. There is an open bug about this which we hope to get fixed in the rhel/centos 7.2. https://bugzilla.redhat.com/1139038
This issue together with "no_path_retry queue" is a very bad mix for ovirt.
You can fix this timeout by setting:
# /etc/iscsi/iscsid.conf node.session.timeo.replacement_timeout = 5
I'll see if that's possible with persist. Will this change survive node upgrades? Thanks for the reply and the suggestions.

This is a multi-part message in MIME format. ------------MIME-1975911516-1580342552-delim Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: quoted-printable On 05/21/2015 02=3A47 AM=2C Chris Jones - BookIt=2Ecom Systems Administrato= r wrote=3A =3E=3E Another issue may be that the setting for COMPELNT/Compellent Vol ar= e =3E=3E wrong=3B =3E=3E the setting we ship is missing lot of settings that exists in the =3E=3E builtin =3E=3E setting=2C and this may have bad effect=2E If your devices match thi= s =2C I =3E=3E would =3E=3E try this multipath configuration=2C instead of the one vdsm configur= es=2E =3E=3E =3E=3E device =7B =3E=3E vendor =22COMPELNT=22 =3E=3E product =22Compellent Vol=22 =3E=3E path=5Fgrouping=5Fpolicy =22multibus=22 =3E=3E path=5Fchecker =22tur=22 =3E=3E features =220=22 =3E=3E hardware=5Fhandler =220=22 =3E=3E prio =22const=22 =3E=3E failback =22immediate=22 =3E=3E rr=5Fweight =22uniform=22 =3E=3E no=5Fpath=5Fretry fail =3E=3E =7D =3E =3E I wish I could=2E We=27re using the CentOS 7 ovirt-node-iso=2E The =3E multipath=2Econf is less than ideal but when I tried updating it=2C oVi= rt =3E instantly overwrites it=2E To be clear=2C yes I know changes do not =3E survive reboots and yes I know about persist=2C but it changes it while= =3E running=2E Live! Persist won=27t help there=2E =3E =3E I also tried building a CentOS 7 =22thick client=22 where I set up Cent= OS =3E 7 first=2C added the oVirt repo=2C then let the engine provision it=2E= Same =3E problem with multipath=2Econf being overwritten with the default oVirt= =3E setup=2E =3E =3E So I tried to be slick about it=2E I made the multipath=2Econf immutabl= e=2E =3E That prevented the engine from being able to activate the node=2E It =3E would fail on a vds command that gets the nodes capabilities and part= =3E of what it does is reads then overwrites multipath=2Econf=2E =3E =3E How do I safely update multipath=2Econf=3F =3E Somehow the multipath=2Econf that oVirt generates forces my HDD RAID controller disks to change from /dev/sdb* and /dev/sdc*=2E So I had to blacklist these=2E I was able to persist it by adding =22=23 RHEV PRIVATE=22 right below the= =22=23 RHEV REVISION 1=2E1=22 Hope this helps Met vriendelijke groet=2C With kind regards=2C Jorick Astrego Netbulae Virtualization Experts=20 ---------------- =09Tel=3A 053 20 30 270 =09info=40netbulae=2Eeu =09Staalsteden 4-3A =09KvK= 08198180 =09Fax=3A 053 20 30 271 =09www=2Enetbulae=2Eeu =097547 TA Enschede =09BTW= NL821234584B01 ---------------- ------------MIME-1975911516-1580342552-delim Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable =3Chtml=3E =3Cbody=3E <br> <br> On 05/21/2015 02:47 AM, Chris Jones - BookIt.com Systems Administrator = <br> wrote: <br> <font color=3D"#000000">>> Another issue may be that the setting for = COMPELNT/Compellent Vol are </font><br> <font color=3D"#000000">>> wrong; </font><br> <font color=3D"#000000">>> the setting we ship is missing lot of sett= ings that exists in the </font><br> <font color=3D"#000000">>> builtin </font><br> <font color=3D"#000000">>> setting, and this may have bad effect. If = your devices match this , I </font><br> <font color=3D"#000000">>> would </font><br> <font color=3D"#000000">>> try this multipath configuration, instead = of the one vdsm configures. </font><br> <font color=3D"#000000">>> </font><br> <font color=3D"#000000">>> device { </fon= t><br> <font color=3D"#000000">>> &= nbsp; vendor "COMPELNT" </font><br> <font color=3D"#000000">>> &= nbsp; product "Compellent Vol" </font><br> <font color=3D"#000000">>> &= nbsp; path_grouping_policy "multibus" </font><br> <font color=3D"#000000">>> &= nbsp; path_checker "tur" </font><br> <font color=3D"#000000">>> &= nbsp; features "0" </font><br> <font color=3D"#000000">>> &= nbsp; hardware_handler "0" </font><br> <font color=3D"#000000">>> &= nbsp; prio "const" </font><br> <font color=3D"#000000">>> &= nbsp; failback "immediate" </font><br> <font color=3D"#000000">>> &= nbsp; rr_weight "uniform" </font><br> <font color=3D"#000000">>> &= nbsp; no_path_retry fail </font><br> <font color=3D"#000000">>> } </font><br> <font color=3D"#000000">> </font><br> <font color=3D"#000000">> I wish I could. We're using the CentOS 7 ovirt= -node-iso. The </font><br> <font color=3D"#000000">> multipath.conf is less than ideal but when I t= ried updating it, oVirt </font><br> <font color=3D"#000000">> instantly overwrites it. To be clear, yes I kn= ow changes do not </font><br> <font color=3D"#000000">> survive reboots and yes I know about persist, = but it changes it while </font><br> <font color=3D"#000000">> running. Live! Persist won't help there. <= /font><br> <font color=3D"#000000">> </font><br> <font color=3D"#000000">> I also tried building a CentOS 7 "thick c= lient" where I set up CentOS </font><br> <font color=3D"#000000">> 7 first, added the oVirt repo, then let the en= gine provision it. Same </font><br> <font color=3D"#000000">> problem with multipath.conf being overwritten = with the default oVirt </font><br> <font color=3D"#000000">> setup. </font><br> <font color=3D"#000000">> </font><br> <font color=3D"#000000">> So I tried to be slick about it. I made the mu= ltipath.conf immutable. </font><br> <font color=3D"#000000">> That prevented the engine from being able to a= ctivate the node. It </font><br> <font color=3D"#000000">> would fail on a vds command that gets the node= s capabilities and part </font><br> <font color=3D"#000000">> of what it does is reads then overwrites multi= path.conf. </font><br> <font color=3D"#000000">> </font><br> <font color=3D"#000000">> How do I safely update multipath.conf? </f= ont><br> <font color=3D"#000000">> </font><br> <br> Somehow the multipath.conf that oVirt generates forces my HDD RAID <br> controller disks to change from /dev/sdb* and /dev/sdc*. So I had to <b= r> blacklist these. <br> <br> I was able to persist it by adding "# RHEV PRIVATE" right below t= he "# <br> RHEV REVISION 1.1" <br> <br> Hope this helps <br> <br> <br> <br> = =3CBR /=3E =3CBR /=3E =3Cb style=3D=22color=3A=23604c78=22=3E=3C/b=3E=3Cbr=3E=3Cspan style=3D=22c= olor=3A=23604c78=3B=22=3E=3Cfont color=3D=22000000=22=3E=3Cspan style=3D=22= mso-fareast-language=3Aen-gb=3B=22 lang=3D=22NL=22=3EMet vriendelijke groet= =2C With kind regards=2C=3Cbr=3E=3Cbr=3E=3C/span=3EJorick Astrego=3C/font= =3E=3C/span=3E=3Cb style=3D=22color=3A=23604c78=22=3E=3Cbr=3E=3Cbr=3ENetbul= ae Virtualization Experts =3C/b=3E=3Cbr=3E=3Chr style=3D=22border=3Anone=3B= border-top=3A1px solid =23ccc=3B=22=3E=3Ctable style=3D=22width=3A 522px=22= =3E=3Ctbody=3E=3Ctr=3E=3Ctd style=3D=22width=3A 130px=3Bfont-size=3A 10px= =22=3ETel=3A 053 20 30 270=3C/td=3E =3Ctd style=3D=22width=3A 130px=3Bf= ont-size=3A 10px=22=3Einfo=40netbulae=2Eeu=3C/td=3E =3Ctd style=3D=22wid= th=3A 130px=3Bfont-size=3A 10px=22=3EStaalsteden 4-3A=3C/td=3E =3Ctd sty= le=3D=22width=3A 130px=3Bfont-size=3A 10px=22=3EKvK 08198180=3C/td=3E=3C/tr= =3E=3Ctr=3E =3Ctd style=3D=22width=3A 130px=3Bfont-size=3A 10px=22=3EFax= =3A 053 20 30 271=3C/td=3E =3Ctd style=3D=22width=3A 130px=3Bfont-size= =3A 10px=22=3Ewww=2Enetbulae=2Eeu=3C/td=3E =3Ctd style=3D=22width=3A 130= px=3Bfont-size=3A 10px=22=3E7547 TA Enschede=3C/td=3E =3Ctd style=3D=22w= idth=3A 130px=3Bfont-size=3A 10px=22=3EBTW NL821234584B01=3C/td=3E=3C/tr=3E= =3C/tbody=3E=3C/table=3E=3Cbr=3E=3Chr style=3D=22border=3Anone=3Bborder-top= =3A1px solid =23ccc=3B=22=3E=3CBR /=3E =3C/body=3E =3C/html=3E ------------MIME-1975911516-1580342552-delim--

On 21.05.2015 02:48, Chris Jones - BookIt.com Systems Administrator wrote:
Another issue may be that the setting for COMPELNT/Compellent Vol are wrong; the setting we ship is missing lot of settings that exists in the builtin setting, and this may have bad effect. If your devices match this , I would try this multipath configuration, instead of the one vdsm configures.
device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback "immediate" rr_weight "uniform" no_path_retry fail }
I wish I could. We're using the CentOS 7 ovirt-node-iso. The multipath.conf is less than ideal I have this issue also. I think about opening a BZ ;)
but when I tried updating it, oVirt
instantly overwrites it. To be clear, yes I know changes do not survive reboots and yes I know about persist, but it changes it while running. Live! Persist won't help there.
I also tried building a CentOS 7 "thick client" where I set up CentOS 7 first, added the oVirt repo, then let the engine provision it. Same problem with multipath.conf being overwritten with the default oVirt setup.
So I tried to be slick about it. I made the multipath.conf immutable. That prevented the engine from being able to activate the node. It would fail on a vds command that gets the nodes capabilities and part of what it does is reads then overwrites multipath.conf.
How do I safely update multipath.conf?
In the second line of your multipath conf, add: # RHEV PRIVATE Then, host deploy will ignore it and never change it.
To verify that your devices match this, you can check the devices vendor and procut strings in the output of "multipath -ll". I would like to see the output of this command.
multipath -ll (default setup) can be seen here. http://paste.linux-help.org/view/430c7538
Another platform issue is bad default SCSI node.session.timeo.replacement_timeout value, which is set to 120 seconds. This setting mean that the SCSI layer will wait 120 seconds for io to complete on one path, before failing the io request. So you may have one bad path, causing 120 second delay, while you could complete the request using another path.
Multipath is trying to set this value to 5 seconds, but this value is reverting to the default 120 seconds after a device has trouble. There is an open bug about this which we hope to get fixed in the rhel/centos 7.2. https://bugzilla.redhat.com/1139038
This issue together with "no_path_retry queue" is a very bad mix for ovirt.
You can fix this timeout by setting:
# /etc/iscsi/iscsid.conf node.session.timeo.replacement_timeout = 5
I'll see if that's possible with persist. Will this change survive node upgrades?
Thanks for the reply and the suggestions. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Daniel Helgenberger m box bewegtbild GmbH P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19 D-10115 BERLIN www.m-box.de www.monkeymen.tv Geschäftsführer: Martin Retschitzegger / Michaela Göllner Handeslregister: Amtsgericht Charlottenburg / HRB 112767

I've applied the multipath.conf and iscsi.conf changes you recommended. It seems to be running better. I was able to bring up all the hosts and VMs without it falling apart. I'm still seeing the domain "in problem" and "recovered" from problem warnings in engine.log, though. They were happening only when hosts were activating and when I was mass launching many VMs. Is this normal? 2015-05-21 15:31:32,264 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c2.ism.ld 2015-05-21 15:31:47,468 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c2.ism.ld Here's the vdsm log from a node the engine was warning about https://www.dropbox.com/s/yaubaxax1w499f1/vdsm2.log.gz?dl=1. It's trimmed to just before and after it happened. What is that repo stat command from your previous email, Nir? "repostat vdsm.log" I don't see it on the engine or the node. Is it used to parse the log? Where can I find it? Thanks again.

On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator wrote:
I've applied the multipath.conf and iscsi.conf changes you recommended. It seems to be running better. I was able to bring up all the hosts and VMs without it falling apart.
I take it back. This did not solve the issue. I tried batch starting the VMs and half the nodes went down due to the same storage issues. VDSM Logs again. https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1

----- Original Message -----
On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator wrote:
I've applied the multipath.conf and iscsi.conf changes you recommended. It seems to be running better. I was able to bring up all the hosts and VMs without it falling apart.
I take it back. This did not solve the issue. I tried batch starting the VMs and half the nodes went down due to the same storage issues. VDSM
Is there maybe some IO problem on the iSCSI target side? IIUIC the problem is some timeout, which could indicate that the target is overloaded. But maybe I get something wrong ... - fabian
Logs again. https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1 _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Is there maybe some IO problem on the iSCSI target side? IIUIC the problem is some timeout, which could indicate that the target is overloaded.
Maybe. I need to check with Dell. I did manage to get it to be a little more stable with this config. defaults { polling_interval 10 path_selector "round-robin 0" path_grouping_policy multibus getuid_callout "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n" path_checker readsector0 rr_min_io_rq 100 max_fds 8192 rr_weight priorities failback immediate no_path_retry fail user_friendly_names no } devices { device { vendor COMPELNT product "Compellent Vol" path_checker tur no_path_retry fail } } I referenced it from http://en.community.dell.com/techcenter/enterprise-solutions/w/oracle_soluti.... I modified it a bit since that is Red Hat 5 specific and there have been some changes. It's not crashing anymore but I'm still seeing storage warnings in engine.log. I'm going to be enabling jumbo frames and talking with Dell to figure out if it's something on the Compellent side. I'll update here once I find something out. Thanks again for all the help.

----- Original Message -----
From: "Chris Jones - BookIt.com Systems Administrator" <chris.jones@bookit.com> To: users@ovirt.org Sent: Friday, May 22, 2015 8:55:37 PM Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Is there maybe some IO problem on the iSCSI target side? IIUIC the problem is some timeout, which could indicate that the target is overloaded.
Maybe. I need to check with Dell. I did manage to get it to be a little more stable with this config.
defaults { polling_interval 10 path_selector "round-robin 0" path_grouping_policy multibus getuid_callout "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n" path_checker readsector0 rr_min_io_rq 100 max_fds 8192 rr_weight priorities failback immediate no_path_retry fail user_friendly_names no
You should keep the default without change, and add specific settings under the device section.
} devices { device { vendor COMPELNT product "Compellent Vol" path_checker tur no_path_retry fail
This is mostly likely missing some settings. You are *not* getting the settings from the "defaults" section above. For example, since you did not specify here "failback immediate", failback for this device defaults to whatever default multipath chose, not the value set in "defaults" above.
} }
I referenced it from http://en.community.dell.com/techcenter/enterprise-solutions/w/oracle_soluti.... I modified it a bit since that is Red Hat 5 specific and there have been some changes.
It's not crashing anymore but I'm still seeing storage warnings in engine.log. I'm going to be enabling jumbo frames and talking with Dell to figure out if it's something on the Compellent side. I'll update here once I find something out.
Lets continue this on bugzilla. See also this patch: https://gerrit.ovirt.org/41244 Nir


----- Original Message -----
From: "Chris Jones - BookIt.com Systems Administrator" <chris.jones@bookit.com> To: users@ovirt.org Sent: Friday, May 22, 2015 12:32:01 AM Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator wrote:
I've applied the multipath.conf and iscsi.conf changes you recommended. It seems to be running better. I was able to bring up all the hosts and VMs without it falling apart.
I take it back. This did not solve the issue. I tried batch starting the VMs and half the nodes went down due to the same storage issues. VDSM Logs again. https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1
It is possible that the multipath configuration I suggested is not optimized correctly for your server or it is too old (last update on 2013). Or you have some issues in the network or storage server. I would continue with the storage vendor. Nir

----- Original Message -----
From: "Chris Jones - BookIt.com Systems Administrator" <chris.jones@bookit.com> To: users@ovirt.org Sent: Thursday, May 21, 2015 10:49:23 PM Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
I've applied the multipath.conf and iscsi.conf changes you recommended. It seems to be running better. I was able to bring up all the hosts and VMs without it falling apart.
I'm still seeing the domain "in problem" and "recovered" from problem warnings in engine.log, though. They were happening only when hosts were activating and when I was mass launching many VMs. Is this normal?
2015-05-21 15:31:32,264 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c2.ism.ld 2015-05-21 15:31:47,468 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c2.ism.ld
Here's the vdsm log from a node the engine was warning about https://www.dropbox.com/s/yaubaxax1w499f1/vdsm2.log.gz?dl=1. It's trimmed to just before and after it happened.
What is that repo stat command from your previous email, Nir? "repostat vdsm.log" I don't see it on the engine or the node. Is it used to parse the log? Where can I find it?
It is available here: https://gerrit.ovirt.org/38749 Nir

From my experience I can advise you to A) check links between SAN and servers, all paths, all configuration, cabling. Everything should be setup correctly (all redundant paths green, server mappings etc) BEFORE installing ovirt. We had a running KVM environment before "upgrading" it to ovirt 3.5.1 B) Also check fencing is working both manually and automatically (connections to iDRAC etc). This is a kind of pre-requisite to have HA working. C) I also noticed that when something is not going well on one of the shared storage, this brings down the whole cluster (VM run, but a lot of
Hi Chris, I have an Ovirt + Dell Compellent similar to yours (previous model, not SC8000) and sometimes I faced issues similar to yours. headaches being). First of all note that ovirt tries to stabilize the situation itself for as long as ~15 minutes or more. It is slow in re-fencing etc. Sometimes it enters in a loop and you have to locate the problematic storage. You want to check the multipath on every server is working correctly. If you are having problems with just two nodes, I guess something is not really ok at configuration level. I have 2 clusters, 12 hosts and several (lots) of shared storage working and usually when something goes wrong is because of an human error (like when I deleted the LUN on the SAN before destroying the storage on the ovirt interface). On the hand, I have the overall impression that the system is not forgiving at all and that it is far from being rock solid. Cheers AG

Sorry for the delay on this. I am in the process of reproducing the error to get the logs. On 05/19/2015 07:31 PM, Douglas Schilling Landgraf wrote:
Hello Chris,
On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator wrote:
Engine: oVirt Engine Version: 3.5.2-1.el7.centos Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos Remote storage: Dell Compellent SC8000 Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN. Networking: Dell 10 Gb/s switches
I've been struggling with oVirt completely falling apart due to storage related issues. By falling apart I mean most to all of the nodes suddenly losing contact with the storage domains. This results in an endless loop of the VMs on the failed nodes trying to be migrated and remigrated as the nodes flap between response and unresponsive. During these times, engine.log looks like this.
2015-05-19 03:09:42,443 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-50) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c1.ism.ld 2015-05-19 03:09:42,560 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:45,497 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-24) domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: blade3c2.ism.ld 2015-05-19 03:09:51,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-46) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade4c2.ism.ld 2015-05-19 03:09:57,647 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c1.ism.ld 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld 2015-05-19 03:10:00,712 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade3c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:06,932 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,933 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:09,929 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-16) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade3c1.ism.ld
My troubleshooting steps so far:
1. Tailing engine.log for "in problem" and "recovered from problem" 2. Shutting down all the VMs. 3. Shutting down all but one node. 4. Bringing up one node at a time to see what the log reports.
vdsm.log in the node side, will help here too.
When only one node is active everything is fine. When a second node comes up, I begin to see the log output as shown above. I've been struggling with this for over a month. I'm sure others have used oVirt with a Compellent and encountered (and worked around) similar problems. I'm looking for some help in figuring out if it's oVirt or something that I'm doing wrong.
We're close to giving up on oVirt completely because of this.
P.S.
I've tested via bare metal and Proxmox with the Compellent. Not at the same scale but it seems to work fine there.
Do you mind to share your vdsm.log from the oVirt Node machine?
To go to console in oVirt Node, press F2 in TUI. Files: /var/log/vdsm/vdsm*.log
# rpm -qa | grep -i vdsm might help too.
Thanks!

Since this thread shows up at the top of the search "oVirt compellent", I should mention that this has been solved. The problem was a bad disk in the Compellent's tier 2 storage. The mutlipath.conf and iscsi.conf advice is still valid, though, and made oVirt more resilient when the Compellent was struggling. -- This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin.
participants (7)
-
Andrea Ghelardi
-
Chris Jones - BookIt.com Systems Administrator
-
Daniel Helgenberger
-
Douglas Schilling Landgraf
-
Fabian Deutsch
-
Jorick Astrego
-
Nir Soffer