[Users] Storage unresponsive after sanlock

older
[Users] How to setup FreeBSD 8.3...

Trey Dockendorf

27 Jan 2014 27 Jan '14

5:12 p.m.

I setup my first oVirt instance since 3.0 a few days ago and it went very well, and I left the single host cluster running with 1 VM over the weekend. Today I come back and the primary data storage is marked as unresponsive. The logs are full of entries [1] that look very similar to a knowledge base article on RHEL's website [2]. This setup is using NFS over RDMA and so far the ib interfaces report no errors (via `ibcheckerrs -v <LID> 1`). Based on a doc on ovirt site [3] it seems this could be due to response problems. The storage system is a new purchase and not yet in production so if there's any advice on how to track down the cause that would be very helpful. Please let me know what additional information would be helpful as it's been about a year since I've been active in the oVirt community. Thanks - Trey [1]: http://pastebin.com/yRpSLKxJ [2]: https://access.redhat.com/site/solutions/400463 [3]: http://www.ovirt.org/SANLock

Show replies by date

Maor Lipchuk

28 Jan 28 Jan

9:45 a.m.

Hi Trey, Can you please also attach the engine/vdsm logs. Thanks, Maor On 01/27/2014 06:12 PM, Trey Dockendorf wrote:

...

I setup my first oVirt instance since 3.0 a few days ago and it went very well, and I left the single host cluster running with 1 VM over the weekend. Today I come back and the primary data storage is marked as unresponsive. The logs are full of entries [1] that look very similar to a knowledge base article on RHEL's website [2].

This setup is using NFS over RDMA and so far the ib interfaces report no errors (via `ibcheckerrs -v <LID> 1`). Based on a doc on ovirt site [3] it seems this could be due to response problems. The storage system is a new purchase and not yet in production so if there's any advice on how to track down the cause that would be very helpful. Please let me know what additional information would be helpful as it's been about a year since I've been active in the oVirt community.

Thanks - Trey

[1]: http://pastebin.com/yRpSLKxJ

[2]: https://access.redhat.com/site/solutions/400463

[3]: http://www.ovirt.org/SANLock _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Trey Dockendorf

29 Jan 29 Jan

1:50 a.m.

See attached. The event seems to have begun around 06:00:00 on 2014-01-26. I was unable to get the single node cluster back online so I provisioned another node to add to the cluster, which became the SPM. Adding the second node worked and I had to power cycle the node that hung as sanlock was in a zombie state. This is my first attempt at production use of NFS over RDMA and I'd like to rule out that being the cause. Since the issue I've changed the 'nfs_mount_options' in /etc/vdsm/vdsm.conf to 'soft,nosharecache,rdma,port=20049'. The options during the crash were only 'rdma,port=20049'. I am also forcing NFSv3 by setting 'Nfsvers=3' in /etc/nfsmount.conf, which is still in place and was in place during the crash. Thanks - Trey On Tue, Jan 28, 2014 at 2:45 AM, Maor Lipchuk <mlipchuk@redhat.com> wrote:

...

Hi Trey,

Can you please also attach the engine/vdsm logs.

Thanks, Maor

On 01/27/2014 06:12 PM, Trey Dockendorf wrote:

...
I setup my first oVirt instance since 3.0 a few days ago and it went very well, and I left the single host cluster running with 1 VM over the weekend. Today I come back and the primary data storage is marked as unresponsive. The logs are full of entries [1] that look very similar to a knowledge base article on RHEL's website [2].

This setup is using NFS over RDMA and so far the ib interfaces report no errors (via `ibcheckerrs -v <LID> 1`). Based on a doc on ovirt site [3] it seems this could be due to response problems. The storage system is a new purchase and not yet in production so if there's any advice on how to track down the cause that would be very helpful. Please let me know what additional information would be helpful as it's been about a year since I've been active in the oVirt community.

Thanks - Trey

[1]: http://pastebin.com/yRpSLKxJ

[2]: https://access.redhat.com/site/solutions/400463

[3]: http://www.ovirt.org/SANLock _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Maor Lipchuk

11:33 a.m.

The VDSM log seems to be from the 26th and from the engine logs it seems that the incident occurred at the 24th, so I can't really see the what happened in VDSM that time.

...

From the engine logs it seems that at around 2014-01-24 16:59 the master storage domain was in maintenance and then there was an attempt to activate it, but VDSM threw an exception that it cannot find master domain with the arguments of spUUID=5849b030-626e-47cb-ad90-3ce782d831b3, msdUUID=7c49750d-7eae-4cd2-9b63-1dc71f357b88'

This could be happen from various reasons, for example a failure in connecting the storage (for example see https://bugzilla.redhat.com/782864) Since you mentioned that once you have added a second node and it worked, it seems like to origin of the problem is in the Host it self. what are the differences between the two hosts (VDSM version, OS version) Does the first host succeeded to work on other DC? Have you tried to reinstall it? Regards, Maor On 01/29/2014 02:50 AM, Trey Dockendorf wrote:

...

See attached. The event seems to have begun around 06:00:00 on 2014-01-26. I was unable to get the single node cluster back online so I provisioned another node to add to the cluster, which became the SPM. Adding the second node worked and I had to power cycle the node that hung as sanlock was in a zombie state. This is my first attempt at production use of NFS over RDMA and I'd like to rule out that being the cause. Since the issue I've changed the 'nfs_mount_options' in /etc/vdsm/vdsm.conf to 'soft,nosharecache,rdma,port=20049'. The options during the crash were only 'rdma,port=20049'. I am also forcing NFSv3 by setting 'Nfsvers=3' in /etc/nfsmount.conf, which is still in place and was in place during the crash.

Thanks - Trey

On Tue, Jan 28, 2014 at 2:45 AM, Maor Lipchuk <mlipchuk@redhat.com> wrote:

...
Hi Trey,

Can you please also attach the engine/vdsm logs.

Thanks, Maor

On 01/27/2014 06:12 PM, Trey Dockendorf wrote:

...
I setup my first oVirt instance since 3.0 a few days ago and it went very well, and I left the single host cluster running with 1 VM over the weekend. Today I come back and the primary data storage is marked as unresponsive. The logs are full of entries [1] that look very similar to a knowledge base article on RHEL's website [2].

This setup is using NFS over RDMA and so far the ib interfaces report no errors (via `ibcheckerrs -v <LID> 1`). Based on a doc on ovirt site [3] it seems this could be due to response problems. The storage system is a new purchase and not yet in production so if there's any advice on how to track down the cause that would be very helpful. Please let me know what additional information would be helpful as it's been about a year since I've been active in the oVirt community.

Thanks - Trey

[1]: http://pastebin.com/yRpSLKxJ

[2]: https://access.redhat.com/site/solutions/400463

[3]: http://www.ovirt.org/SANLock _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Trey Dockendorf

5:16 p.m.

On Wed, Jan 29, 2014 at 4:33 AM, Maor Lipchuk <mlipchuk@redhat.com> wrote:

...

The VDSM log seems to be from the 26th and from the engine logs it seems that the incident occurred at the 24th, so I can't really see the what happened in VDSM that time.

From the engine logs it seems that at around 2014-01-24 16:59 the master storage domain was in maintenance and then there was an attempt to activate it, but VDSM threw an exception that it cannot find master domain with the arguments of spUUID=5849b030-626e-47cb-ad90-3ce782d831b3, msdUUID=7c49750d-7eae-4cd2-9b63-1dc71f357b88'

This could be happen from various reasons, for example a failure in connecting the storage (for example see https://bugzilla.redhat.com/782864)

Some errors on my part that occurred before the sanlock issue were having all the NFS exports with same "fsid", as well as initial failures to correctly pass custom NFS options to VDSM. The sanlock issue was not present as late as 18:00 on 2014-01-24 as I was still working in the web interface at that time and saw no issues.

...

Since you mentioned that once you have added a second node and it worked, it seems like to origin of the problem is in the Host it self.

what are the differences between the two hosts (VDSM version, OS version)

There should be no differences. They are identical hardware and provisioned and configured using Puppet. * vdsm-4.13.3-2.el6.x86_64 * OS is CentOS 6.5 - 2.6.32-431.3.1.el6.x86_64

...

Does the first host succeeded to work on other DC?

I only have the default DC defined. Would it be worth setting up another DC for the sake of troubleshooting?

...

Have you tried to reinstall it?

Not yet. The install processes is automated as well as the configuration, so whatever issues I'm running into SHOULD be present upon re-install. If there is a possibility a fresh install could somehow fix this, I can re-provision. I just noticed the 2nd host (vm02) added to the default cluster has become Non Operational and the VM on that host failed to migrate to the 1st host (vm01) which became SPM and is marked as "Up". The logs on vm02 are full of sanlock messages. What concerns me is the VM I have running for testing is non responsive and vm01 shows messages such as "Time out during operation: cannot acquire state change lock". I can't yet pinpoint when the failure occurred and to avoid sending 3 days worth of logs from 3 hosts I'll reset everything and try to reproduce this with some monitoring to get a timestamp for approximate time of failure. Thanks - Trey

...

Regards, Maor

On 01/29/2014 02:50 AM, Trey Dockendorf wrote:

...
See attached. The event seems to have begun around 06:00:00 on 2014-01-26. I was unable to get the single node cluster back online so I provisioned another node to add to the cluster, which became the SPM. Adding the second node worked and I had to power cycle the node that hung as sanlock was in a zombie state. This is my first attempt at production use of NFS over RDMA and I'd like to rule out that being the cause. Since the issue I've changed the 'nfs_mount_options' in /etc/vdsm/vdsm.conf to 'soft,nosharecache,rdma,port=20049'. The options during the crash were only 'rdma,port=20049'. I am also forcing NFSv3 by setting 'Nfsvers=3' in /etc/nfsmount.conf, which is still in place and was in place during the crash.

Thanks - Trey

On Tue, Jan 28, 2014 at 2:45 AM, Maor Lipchuk <mlipchuk@redhat.com> wrote:

...
Hi Trey,

Can you please also attach the engine/vdsm logs.

Thanks, Maor

On 01/27/2014 06:12 PM, Trey Dockendorf wrote:

...
I setup my first oVirt instance since 3.0 a few days ago and it went very well, and I left the single host cluster running with 1 VM over the weekend. Today I come back and the primary data storage is marked as unresponsive. The logs are full of entries [1] that look very similar to a knowledge base article on RHEL's website [2].

This setup is using NFS over RDMA and so far the ib interfaces report no errors (via `ibcheckerrs -v <LID> 1`). Based on a doc on ovirt site [3] it seems this could be due to response problems. The storage system is a new purchase and not yet in production so if there's any advice on how to track down the cause that would be very helpful. Please let me know what additional information would be helpful as it's been about a year since I've been active in the oVirt community.

Thanks - Trey

[1]: http://pastebin.com/yRpSLKxJ

[2]: https://access.redhat.com/site/solutions/400463

[3]: http://www.ovirt.org/SANLock _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Trey Dockendorf

10:17 p.m.

On Wed, Jan 29, 2014 at 4:33 AM, Maor Lipchuk <mlipchuk@redhat.com> wrote:

...

The VDSM log seems to be from the 26th and from the engine logs it seems that the incident occurred at the 24th, so I can't really see the what happened in VDSM that time.

From the engine logs it seems that at around 2014-01-24 16:59 the master storage domain was in maintenance and then there was an attempt to activate it, but VDSM threw an exception that it cannot find master domain with the arguments of spUUID=5849b030-626e-47cb-ad90-3ce782d831b3, msdUUID=7c49750d-7eae-4cd2-9b63-1dc71f357b88'

The actual error was higher in the logs after I tried activating this host. Puppet had removed the unmanaged /etc/sudoers.d/50_vdsm file and that was preventing vdsm from being able to execute any mount commands. The issues with vm02 are likely all due to that mistake on my part. My apologies. - Trey

...

This could be happen from various reasons, for example a failure in connecting the storage (for example see https://bugzilla.redhat.com/782864)

Since you mentioned that once you have added a second node and it worked, it seems like to origin of the problem is in the Host it self.

what are the differences between the two hosts (VDSM version, OS version) Does the first host succeeded to work on other DC? Have you tried to reinstall it?

Regards, Maor

On 01/29/2014 02:50 AM, Trey Dockendorf wrote:

...
See attached. The event seems to have begun around 06:00:00 on 2014-01-26. I was unable to get the single node cluster back online so I provisioned another node to add to the cluster, which became the SPM. Adding the second node worked and I had to power cycle the node that hung as sanlock was in a zombie state. This is my first attempt at production use of NFS over RDMA and I'd like to rule out that being the cause. Since the issue I've changed the 'nfs_mount_options' in /etc/vdsm/vdsm.conf to 'soft,nosharecache,rdma,port=20049'. The options during the crash were only 'rdma,port=20049'. I am also forcing NFSv3 by setting 'Nfsvers=3' in /etc/nfsmount.conf, which is still in place and was in place during the crash.

Thanks - Trey

On Tue, Jan 28, 2014 at 2:45 AM, Maor Lipchuk <mlipchuk@redhat.com> wrote:

...
Hi Trey,

Can you please also attach the engine/vdsm logs.

Thanks, Maor

On 01/27/2014 06:12 PM, Trey Dockendorf wrote:

...
I setup my first oVirt instance since 3.0 a few days ago and it went very well, and I left the single host cluster running with 1 VM over the weekend. Today I come back and the primary data storage is marked as unresponsive. The logs are full of entries [1] that look very similar to a knowledge base article on RHEL's website [2].

This setup is using NFS over RDMA and so far the ib interfaces report no errors (via `ibcheckerrs -v <LID> 1`). Based on a doc on ovirt site [3] it seems this could be due to response problems. The storage system is a new purchase and not yet in production so if there's any advice on how to track down the cause that would be very helpful. Please let me know what additional information would be helpful as it's been about a year since I've been active in the oVirt community.

Thanks - Trey

[1]: http://pastebin.com/yRpSLKxJ

[2]: https://access.redhat.com/site/solutions/400463

[3]: http://www.ovirt.org/SANLock _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

4403

Age (days ago)

4405

Last active (days ago)

List overview

Download

5 comments

2 participants

participants (2)

Maor Lipchuk
Trey Dockendorf