[ovirt-users] VMs freezing during heals

Wed Mar 18 13:35:10 UTC 2015

hi,
       Are you using thin-lvm based backend on which the bricks are created?

Pranith
On 03/18/2015 02:05 AM, Alastair Neil wrote:
> I have a Ovirt cluster with 6 VM hosts and 4 gluster nodes. There are 
> two virtualisation clusters one with two nehelem nodes and one with 
>  four  sandybridge nodes. My master storage domain is a GlusterFS 
> backed by a replica 3 gluster volume from 3 of the gluster nodes.  The 
> engine is a hosted engine 3.5.1 on 3 of the sandybridge nodes, with 
> storage broviede by nfs from a different gluster volume.  All the 
> hosts are CentOS 6.6.
>
>      vdsm-4.16.10-8.gitc937927.el6
>     glusterfs-3.6.2-1.el6
>     2.6.32 - 504.8.1.el6.x86_64
>
>
> Problems happen when I try to add a new brick or replace a brick 
> eventually the self heal will kill the VMs. In the VM's logs I see 
> kernel hung task messages.
>
>     Mar 12 23:05:16 static1 kernel: INFO: task nginx:1736 blocked for
>     more than 120 seconds.
>     Mar 12 23:05:16 static1 kernel:      Not tainted
>     2.6.32-504.3.3.el6.x86_64 #1
>     Mar 12 23:05:16 static1 kernel: "echo 0 >
>     /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>     Mar 12 23:05:16 static1 kernel: nginx         D 0000000000000001  
>       0  1736   1735 0x00000080
>     Mar 12 23:05:16 static1 kernel: ffff8800778b17a8 0000000000000082
>     0000000000000000 00000000000126c0
>     Mar 12 23:05:16 static1 kernel: ffff88007e5c6500 ffff880037170080
>     0006ce5c85bd9185 ffff88007e5c64d0
>     Mar 12 23:05:16 static1 kernel: ffff88007a614ae0 00000001722b64ba
>     ffff88007a615098 ffff8800778b1fd8
>     Mar 12 23:05:16 static1 kernel: Call Trace:
>     Mar 12 23:05:16 static1 kernel: [<ffffffff8152a885>]
>     schedule_timeout+0x215/0x2e0
>     Mar 12 23:05:16 static1 kernel: [<ffffffff8152a503>]
>     wait_for_common+0x123/0x180
>     Mar 12 23:05:16 static1 kernel: [<ffffffff81064b90>] ?
>     default_wake_function+0x0/0x20
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa0210a76>] ?
>     _xfs_buf_read+0x46/0x60 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa02063c7>] ?
>     xfs_trans_read_buf+0x197/0x410 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffff8152a61d>]
>     wait_for_completion+0x1d/0x20
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa020ff5b>]
>     xfs_buf_iowait+0x9b/0x100 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa02063c7>] ?
>     xfs_trans_read_buf+0x197/0x410 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa0210a76>]
>     _xfs_buf_read+0x46/0x60 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa0210b3b>]
>     xfs_buf_read+0xab/0x100 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa02063c7>]
>     xfs_trans_read_buf+0x197/0x410 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa01ee6a4>]
>     xfs_imap_to_bp+0x54/0x130 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa01f077b>]
>     xfs_iread+0x7b/0x1b0 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffff811ab77e>] ?
>     inode_init_always+0x11e/0x1c0
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa01eb5ee>]
>     xfs_iget+0x27e/0x6e0 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa01eae1d>] ?
>     xfs_iunlock+0x5d/0xd0 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa0209366>]
>     xfs_lookup+0xc6/0x110 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa0216024>]
>     xfs_vn_lookup+0x54/0xa0 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffff8119dc65>]
>     do_lookup+0x1a5/0x230
>     Mar 12 23:05:16 static1 kernel: [<ffffffff8119e8f4>]
>     __link_path_walk+0x7a4/0x1000
>     Mar 12 23:05:16 static1 kernel: [<ffffffff811738e7>] ?
>     cache_grow+0x217/0x320
>     Mar 12 23:05:16 static1 kernel: [<ffffffff8119f40a>]
>     path_walk+0x6a/0xe0
>     Mar 12 23:05:16 static1 kernel: [<ffffffff8119f61b>]
>     filename_lookup+0x6b/0xc0
>     Mar 12 23:05:16 static1 kernel: [<ffffffff811a0747>]
>     user_path_at+0x57/0xa0
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa0204e74>] ?
>     _xfs_trans_commit+0x214/0x2a0 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffffa01eae3e>] ?
>     xfs_iunlock+0x7e/0xd0 [xfs]
>     Mar 12 23:05:16 static1 kernel: [<ffffffff81193bc0>]
>     vfs_fstatat+0x50/0xa0
>     Mar 12 23:05:16 static1 kernel: [<ffffffff811aaf5d>] ?
>     touch_atime+0x14d/0x1a0
>     Mar 12 23:05:16 static1 kernel: [<ffffffff81193d3b>]
>     vfs_stat+0x1b/0x20
>     Mar 12 23:05:16 static1 kernel: [<ffffffff81193d64>]
>     sys_newstat+0x24/0x50
>     Mar 12 23:05:16 static1 kernel: [<ffffffff810e5c87>] ?
>     audit_syscall_entry+0x1d7/0x200
>     Mar 12 23:05:16 static1 kernel: [<ffffffff810e5a7e>] ?
>     __audit_syscall_exit+0x25e/0x290
>     Mar 12 23:05:16 static1 kernel: [<ffffffff8100b072>]
>     system_call_fastpath+0x16/0x1b
>
>
>
> I am wondering if my volume settings are causing this.  Can anyone 
> with more knowledge take a look and let me know:
>
>     network.remote-dio: on
>     performance.stat-prefetch: off
>     performance.io-cache: off
>     performance.read-ahead: off
>     performance.quick-read: off
>     nfs.export-volumes: on
>     network.ping-timeout: 20
>     cluster.self-heal-readdir-size: 64KB
>     cluster.quorum-type: auto
>     cluster.data-self-heal-algorithm: diff
>     cluster.self-heal-window-size: 8
>     cluster.heal-timeout: 500
>     cluster.self-heal-daemon: on
>     cluster.entry-self-heal: on
>     cluster.data-self-heal: on
>     cluster.metadata-self-heal: on
>     cluster.readdir-optimize: on
>     cluster.background-self-heal-count: 20
>     cluster.rebalance-stats: on
>     cluster.min-free-disk: 5%
>     cluster.eager-lock: enable
>     storage.owner-uid: 36
>     storage.owner-gid: 36
>     auth.allow:*
>     user.cifs: disable
>     cluster.server-quorum-ratio: 51%
>
>
> Many Thanks,  Alastair
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150318/ddb9a332/attachment-0001.html>