Hello Chris,
On 05/19/2015 06:19 PM, Chris Jones -
BookIt.com Systems Administrator
wrote:
> Engine: oVirt Engine Version: 3.5.2-1.el7.centos
> Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos
> Remote storage: Dell Compellent SC8000
> Storage setup: 2 nics connected to the Compellent. Several domains
> backed by LUNs. Several VM disk using direct LUN.
> Networking: Dell 10 Gb/s switches
>
> I've been struggling with oVirt completely falling apart due to storage
> related issues. By falling apart I mean most to all of the nodes
> suddenly losing contact with the storage domains. This results in an
> endless loop of the VMs on the failed nodes trying to be migrated and
> remigrated as the nodes flap between response and unresponsive. During
> these times, engine.log looks like this.
>
> 2015-05-19 03:09:42,443 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-50) domain
> c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds:
> blade6c1.ism.ld
> 2015-05-19 03:09:42,560 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-38) domain
> 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds:
> blade2c1.ism.ld
> 2015-05-19 03:09:45,497 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-24) domain
> 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds:
> blade3c2.ism.ld
> 2015-05-19 03:09:51,713 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-46) domain
> b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
> blade4c2.ism.ld
> 2015-05-19 03:09:57,647 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-13) Domain
> c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
> vds: blade6c1.ism.ld
> 2015-05-19 03:09:57,782 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-6) domain
> 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds:
> blade2c1.ism.ld
> 2015-05-19 03:09:57,783 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-6) Domain
> 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem.
> vds: blade2c1.ism.ld
> 2015-05-19 03:10:00,639 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-31) Domain
> c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem.
> vds: blade4c1.ism.ld
> 2015-05-19 03:10:00,703 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-17) domain
> 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds:
> blade1c1.ism.ld
> 2015-05-19 03:10:00,712 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-4) Domain
> 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem.
> vds: blade3c2.ism.ld
> 2015-05-19 03:10:06,931 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-48) Domain
> 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem.
> vds: blade4c2.ism.ld
> 2015-05-19 03:10:06,931 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-48) Domain
> 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from
> problem. No active host in the DC is reporting it as problematic, so
> clearing the domain recovery timer.
> 2015-05-19 03:10:06,932 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-48) Domain
> b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem.
> vds: blade4c2.ism.ld
> 2015-05-19 03:10:06,933 INFO
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-48) Domain
> b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from
> problem. No active host in the DC is reporting it as problematic, so
> clearing the domain recovery timer.
> 2015-05-19 03:10:09,929 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData]
> (org.ovirt.thread.pool-8-thread-16) domain
> b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds:
> blade3c1.ism.ld
>
>
> My troubleshooting steps so far:
>
> 1. Tailing engine.log for "in problem" and "recovered from
problem"
> 2. Shutting down all the VMs.
> 3. Shutting down all but one node.
> 4. Bringing up one node at a time to see what the log reports.
vdsm.log in the node side, will help here too.
> When only one node is active everything is fine. When a second node
> comes up, I begin to see the log output as shown above. I've been
> struggling with this for over a month. I'm sure others have used oVirt
> with a Compellent and encountered (and worked around) similar problems.
> I'm looking for some help in figuring out if it's oVirt or something
> that I'm doing wrong.
>
> We're close to giving up on oVirt completely because of this.
>
> P.S.
>
> I've tested via bare metal and Proxmox with the Compellent. Not at the
> same scale but it seems to work fine there.
Do you mind to share your vdsm.log from the oVirt Node machine?
To go to console in oVirt Node, press F2 in TUI.
Files: /var/log/vdsm/vdsm*.log
# rpm -qa | grep -i vdsm
might help too.
Hey Chris,
please open a bug [1] for this, then we can track it and we can help to
identify the issue.
I'm not seeing anything suspicious from the logs above. But adding the
logs Douglas mentions above to the new bug will help us to get more insight.
Greetings
fabian
---
[1]