[Users] Storage unresponsive after sanlock

Trey Dockendorf treydock at gmail.com
Wed Jan 29 21:17:56 UTC 2014


On Wed, Jan 29, 2014 at 4:33 AM, Maor Lipchuk <mlipchuk at redhat.com> wrote:
> The VDSM log seems to be from the 26th and from the engine logs it seems
> that the incident occurred at the 24th, so I can't really see the what
> happened in VDSM that time.
>
> From the engine logs it seems that at around 2014-01-24 16:59 the master
> storage domain was in maintenance and then there was an attempt to
> activate it, but VDSM threw an exception that it cannot find master
> domain with the arguments of
> spUUID=5849b030-626e-47cb-ad90-3ce782d831b3,
> msdUUID=7c49750d-7eae-4cd2-9b63-1dc71f357b88'
>

The actual error was higher in the logs after I tried activating this
host. Puppet had removed the unmanaged /etc/sudoers.d/50_vdsm file and
that was preventing vdsm from being able to execute any mount
commands.  The issues with vm02 are likely all due to that mistake on
my part.  My apologies.

- Trey

> This could be happen from various reasons, for example a failure in
> connecting the storage (for example see https://bugzilla.redhat.com/782864)
>
> Since you mentioned that once you have added a second node and it
> worked, it seems like to origin of the problem is in the Host it self.
>
> what are the differences between the two hosts (VDSM version, OS version)
> Does the first host succeeded to work on other DC?
> Have you tried to reinstall it?
>
> Regards,
> Maor
>
>
>
>
> On 01/29/2014 02:50 AM, Trey Dockendorf wrote:
>> See attached.  The event seems to have begun around 06:00:00 on
>> 2014-01-26.  I was unable to get the single node cluster back online
>> so I provisioned another node to add to the cluster, which became the
>> SPM.  Adding the second node worked and I had to power cycle the node
>> that hung as sanlock was in a zombie state.  This is my first attempt
>> at production use of NFS over RDMA and I'd like to rule out that being
>> the cause.  Since the issue I've changed the 'nfs_mount_options' in
>> /etc/vdsm/vdsm.conf to 'soft,nosharecache,rdma,port=20049'.  The
>> options during the crash were only 'rdma,port=20049'.  I am also
>> forcing NFSv3 by setting 'Nfsvers=3' in /etc/nfsmount.conf, which is
>> still in place and was in place during the crash.
>>
>> Thanks
>> - Trey
>>
>> On Tue, Jan 28, 2014 at 2:45 AM, Maor Lipchuk <mlipchuk at redhat.com> wrote:
>>> Hi Trey,
>>>
>>> Can you please also attach the engine/vdsm logs.
>>>
>>> Thanks,
>>> Maor
>>>
>>> On 01/27/2014 06:12 PM, Trey Dockendorf wrote:
>>>> I setup my first oVirt instance since 3.0 a few days ago and it went
>>>> very well, and I left the single host cluster running with 1 VM over
>>>> the weekend.  Today I come back and the primary data storage is marked
>>>> as unresponsive.  The logs are full of entries [1] that look very
>>>> similar to a knowledge base article on RHEL's website [2].
>>>>
>>>> This setup is using NFS over RDMA and so far the ib interfaces report
>>>> no errors (via `ibcheckerrs -v <LID> 1`).  Based on a doc on ovirt
>>>> site [3] it seems this could be due to response problems.  The storage
>>>> system is a new purchase and not yet in production so if there's any
>>>> advice on how to track down the cause that would be very helpful.
>>>> Please let me know what additional information would be helpful as
>>>> it's been about a year since I've been active in the oVirt community.
>>>>
>>>> Thanks
>>>> - Trey
>>>>
>>>> [1]: http://pastebin.com/yRpSLKxJ
>>>>
>>>> [2]: https://access.redhat.com/site/solutions/400463
>>>>
>>>> [3]: http://www.ovirt.org/SANLock
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>>
>>
>



More information about the Users mailing list