[ovirt-users] VMs freezing during heals
Jorick Astrego
j.astrego at netbulae.eu
Sat Apr 4 13:57:32 UTC 2015
On 04/03/2015 10:04 PM, Alastair Neil wrote:
> Any follow up on this?
>
> Are there known issues using a replica 3 glsuter datastore with lvm
> thin provisioned bricks?
>
> On 20 March 2015 at 15:22, Alastair Neil <ajneil.tech at gmail.com
> <mailto:ajneil.tech at gmail.com>> wrote:
>
> CentOS 6.6
>
>
> vdsm-4.16.10-8.gitc937927.el6
> glusterfs-3.6.2-1.el6
> 2.6.32 - 504.8.1.el6.x86_64
>
>
> moved to 3.6 specifically to get the snapshotting feature, hence
> my desire to migrate to thinly provisioned lvm bricks.
>
Well on the glusterfs mailinglist there have been discussions:
> 3.6.2 is a major release and introduces some new features in cluster
> wide concept. Additionally it is not stable yet.
>
>
> On 20 March 2015 at 14:57, Darrell Budic <budic at onholyground.com
> <mailto:budic at onholyground.com>> wrote:
>
> What version of gluster are you running on these?
>
> I’ve seen high load during heals bounce my hosted engine
> around due to overall system load, but never pause anything
> else. Cent 7 combo storage/host systems, gluster 3.5.2.
>
>
>> On Mar 20, 2015, at 9:57 AM, Alastair Neil
>> <ajneil.tech at gmail.com <mailto:ajneil.tech at gmail.com>> wrote:
>>
>> Pranith
>>
>> I have run a pretty straightforward test. I created a two
>> brick 50 G replica volume with normal lvm bricks, and
>> installed two servers, one centos 6.6 and one centos 7.0. I
>> kicked off bonnie++ on both to generate some file system
>> activity and then made the volume replica 3. I saw no issues
>> on the servers.
>>
>> Not clear if this is a sufficiently rigorous test and the
>> Volume I have had issues on is a 3TB volume with about 2TB used.
>>
>> -Alastair
>>
>>
>> On 19 March 2015 at 12:30, Alastair Neil
>> <ajneil.tech at gmail.com <mailto:ajneil.tech at gmail.com>> wrote:
>>
>> I don't think I have the resources to test it
>> meaningfully. I have about 50 vms on my primary storage
>> domain. I might be able to set up a small 50 GB volume
>> and provision 2 or 3 vms running test loads but I'm not
>> sure it would be comparable. I'll give it a try and let
>> you know if I see similar behaviour.
>>
>> On 19 March 2015 at 11:34, Pranith Kumar Karampuri
>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>> wrote:
>>
>> Without thinly provisioned lvm.
>>
>> Pranith
>>
>> On 03/19/2015 08:01 PM, Alastair Neil wrote:
>>> do you mean raw partitions as bricks or simply with
>>> out thin provisioned lvm?
>>>
>>>
>>>
>>> On 19 March 2015 at 00:32, Pranith Kumar Karampuri
>>> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>
>>> wrote:
>>>
>>> Could you let me know if you see this problem
>>> without lvm as well?
>>>
>>> Pranith
>>>
>>> On 03/18/2015 08:25 PM, Alastair Neil wrote:
>>>> I am in the process of replacing the bricks
>>>> with thinly provisioned lvs yes.
>>>>
>>>>
>>>>
>>>> On 18 March 2015 at 09:35, Pranith Kumar
>>>> Karampuri <pkarampu at redhat.com
>>>> <mailto:pkarampu at redhat.com>> wrote:
>>>>
>>>> hi,
>>>> Are you using thin-lvm based backend
>>>> on which the bricks are created?
>>>>
>>>> Pranith
>>>>
>>>> On 03/18/2015 02:05 AM, Alastair Neil wrote:
>>>>> I have a Ovirt cluster with 6 VM hosts and
>>>>> 4 gluster nodes. There are two
>>>>> virtualisation clusters one with two
>>>>> nehelem nodes and one with four
>>>>> sandybridge nodes. My master storage
>>>>> domain is a GlusterFS backed by a replica
>>>>> 3 gluster volume from 3 of the gluster
>>>>> nodes. The engine is a hosted engine
>>>>> 3.5.1 on 3 of the sandybridge nodes, with
>>>>> storage broviede by nfs from a different
>>>>> gluster volume. All the hosts are CentOS
>>>>> 6.6.
>>>>>
>>>>> vdsm-4.16.10-8.gitc937927.el6
>>>>> glusterfs-3.6.2-1.el6
>>>>> 2.6.32 - 504.8.1.el6.x86_64
>>>>>
>>>>>
>>>>> Problems happen when I try to add a new
>>>>> brick or replace a brick eventually the
>>>>> self heal will kill the VMs. In the VM's
>>>>> logs I see kernel hung task messages.
>>>>>
>>>>> Mar 12 23:05:16 static1 kernel: INFO:
>>>>> task nginx:1736 blocked for more than
>>>>> 120 seconds.
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> Not tainted 2.6.32-504.3.3.el6.x86_64 #1
>>>>> Mar 12 23:05:16 static1 kernel: "echo
>>>>> 0 >
>>>>> /proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> Mar 12 23:05:16 static1 kernel: nginx
>>>>> D 0000000000000001 0 1736
>>>>> 1735 0x00000080
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> ffff8800778b17a8 0000000000000082
>>>>> 0000000000000000 00000000000126c0
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> ffff88007e5c6500 ffff880037170080
>>>>> 0006ce5c85bd9185 ffff88007e5c64d0
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> ffff88007a614ae0 00000001722b64ba
>>>>> ffff88007a615098 ffff8800778b1fd8
>>>>> Mar 12 23:05:16 static1 kernel: Call
>>>>> Trace:
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff8152a885>]
>>>>> schedule_timeout+0x215/0x2e0
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff8152a503>]
>>>>> wait_for_common+0x123/0x180
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff81064b90>] ?
>>>>> default_wake_function+0x0/0x20
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa0210a76>] ?
>>>>> _xfs_buf_read+0x46/0x60 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa02063c7>] ?
>>>>> xfs_trans_read_buf+0x197/0x410 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff8152a61d>]
>>>>> wait_for_completion+0x1d/0x20
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa020ff5b>]
>>>>> xfs_buf_iowait+0x9b/0x100 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa02063c7>] ?
>>>>> xfs_trans_read_buf+0x197/0x410 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa0210a76>]
>>>>> _xfs_buf_read+0x46/0x60 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa0210b3b>]
>>>>> xfs_buf_read+0xab/0x100 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa02063c7>]
>>>>> xfs_trans_read_buf+0x197/0x410 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa01ee6a4>]
>>>>> xfs_imap_to_bp+0x54/0x130 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa01f077b>]
>>>>> xfs_iread+0x7b/0x1b0 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff811ab77e>] ?
>>>>> inode_init_always+0x11e/0x1c0
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa01eb5ee>]
>>>>> xfs_iget+0x27e/0x6e0 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa01eae1d>] ?
>>>>> xfs_iunlock+0x5d/0xd0 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa0209366>]
>>>>> xfs_lookup+0xc6/0x110 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa0216024>]
>>>>> xfs_vn_lookup+0x54/0xa0 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff8119dc65>] do_lookup+0x1a5/0x230
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff8119e8f4>]
>>>>> __link_path_walk+0x7a4/0x1000
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff811738e7>] ?
>>>>> cache_grow+0x217/0x320
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff8119f40a>] path_walk+0x6a/0xe0
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff8119f61b>]
>>>>> filename_lookup+0x6b/0xc0
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff811a0747>]
>>>>> user_path_at+0x57/0xa0
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa0204e74>] ?
>>>>> _xfs_trans_commit+0x214/0x2a0 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffffa01eae3e>] ?
>>>>> xfs_iunlock+0x7e/0xd0 [xfs]
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff81193bc0>] vfs_fstatat+0x50/0xa0
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff811aaf5d>] ?
>>>>> touch_atime+0x14d/0x1a0
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff81193d3b>] vfs_stat+0x1b/0x20
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff81193d64>] sys_newstat+0x24/0x50
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff810e5c87>] ?
>>>>> audit_syscall_entry+0x1d7/0x200
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff810e5a7e>] ?
>>>>> __audit_syscall_exit+0x25e/0x290
>>>>> Mar 12 23:05:16 static1 kernel:
>>>>> [<ffffffff8100b072>]
>>>>> system_call_fastpath+0x16/0x1b
>>>>>
>>>>>
>>>>>
>>>>> I am wondering if my volume settings are
>>>>> causing this. Can anyone with more
>>>>> knowledge take a look and let me know:
>>>>>
>>>>> network.remote-dio: on
>>>>> performance.stat-prefetch: off
>>>>> performance.io-cache: off
>>>>> performance.read-ahead: off
>>>>> performance.quick-read: off
>>>>> nfs.export-volumes: on
>>>>> network.ping-timeout: 20
>>>>> cluster.self-heal-readdir-size: 64KB
>>>>> cluster.quorum-type: auto
>>>>> cluster.data-self-heal-algorithm: diff
>>>>> cluster.self-heal-window-size: 8
>>>>> cluster.heal-timeout: 500
>>>>> cluster.self-heal-daemon: on
>>>>> cluster.entry-self-heal: on
>>>>> cluster.data-self-heal: on
>>>>> cluster.metadata-self-heal: on
>>>>> cluster.readdir-optimize: on
>>>>> cluster.background-self-heal-count: 20
>>>>> cluster.rebalance-stats: on
>>>>> cluster.min-free-disk: 5%
>>>>> cluster.eager-lock: enable
>>>>> storage.owner-uid: 36
>>>>> storage.owner-gid: 36
>>>>> auth.allow:*
>>>>> user.cifs: disable
>>>>> cluster.server-quorum-ratio: 51%
>>>>>
>>>>>
>>>>> Many Thanks, Alastair
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org <mailto:Users at ovirt.org>
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at ovirt.org <mailto:Users at ovirt.org>
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org <mailto:Users at ovirt.org>
>> http://lists.ovirt.org/mailman/listinfo/users
>
>
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
Met vriendelijke groet, With kind regards,
Jorick Astrego
Netbulae Virtualization Experts
----------------
Tel: 053 20 30 270 info at netbulae.eu Staalsteden 4-3A KvK 08198180
Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01
----------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150404/7b6ce34e/attachment-0001.html>
More information about the Users
mailing list