[Users] Nodes lose storage at random
Johan Kooijman
mail at johankooijman.com
Sat Feb 22 20:48:25 EST 2014
Been reinstalling to stocj CentOS 6.5 last night, all successful. Until
roughly midnight GMT, 2 out of 4 hosts were showing the same errors.
Any more suggestions?
On Sat, Feb 22, 2014 at 8:57 PM, Nir Soffer <nsoffer at redhat.com> wrote:
> ----- Original Message -----
> > From: "Johan Kooijman" <mail at johankooijman.com>
> > To: "Nir Soffer" <nsoffer at redhat.com>
> > Cc: "users" <users at ovirt.org>
> > Sent: Wednesday, February 19, 2014 2:34:36 PM
> > Subject: Re: [Users] Nodes lose storage at random
> >
> > Messages: https://t-x.dignus.nl/messages.txt
> > Sanlock: https://t-x.dignus.nl/sanlock.log.txt
>
> We can see in /var/log/messages, that sanlock failed to write to
> the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose
> its host id lease. In this case, sanlock kill vdsm [3], which die after 11
> retries [4]. Then vdsm is respawned again [5]. This is expected.
>
> We don't know why sanlock failed to write to the storage, but in [6] the
> kernel tell us that the nfs server is not responding. Since the nfs server
> is accessible from other machines, it means you have some issue with this
> host.
>
> Later the machine reboots [7], and nfs server is still not accessible. Then
> you have lot of WARN_ON call traces [8], that looks related to network
> code.
>
> We can see that you are not running most recent kernel [7]. We experienced
> various
> nfs issues during the 6.5 beta.
>
> I would try to get help from kernel folks about this.
>
> [1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+0000 1251833
> [21345]: s2 delta_renew read rv -202 offset 0
> /rhev/data-center/mnt/10.0.24.1:
> _santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids
>
> [2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+0000 1251882
> [14753]: s2 check_our_lease failed 80
>
> [3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+0000 1251882
> [14753]: s2 kill 19317 sig 15 count 1
>
> [4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+0000 1251892
> [14753]: dead 19317 ci 3 count 11
>
> [5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died,
> respawning slave
>
> [6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding,
> timed out
>
> [7]
> Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started.
> Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 (
> mockbuild at c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
> 4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013
>
> [8]
> Feb 18 18:29:53 hv5 kernel: ------------[ cut here ]------------
> Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759
> skb_gso_segment+0x1df/0x2b0() (Not tainted)
> Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW
> Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596
> data_len=0 ip_summed=0
> Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache
> auth_rpcgss nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe
> libfcoe libfc scsi_transport_fc scsi_tgt
> lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
> iptable_filter ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6
> nf_defrag_ipv6 xt_state nf_conntrack xt_multi
> port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc
> sha256_generic cbc cryptoloop dm_crypt aesni_intel cryptd aes_x86_64
> aes_generic vhost_net macvtap macvlan tun kvm_
> intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp
> dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci
> libsas scsi_transport_sas 3w_
> sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash
> dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi
> cxgb3 mdio libiscsi_tcp qla4xx
> x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded:
> scsi_wait_scan]
> Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted
> 2.6.32-358.18.1.el6.x86_64 #1
> Feb 18 18:29:53 hv5 kernel: Call Trace:
> Feb 18 18:29:53 hv5 kernel: <IRQ> [<ffffffff8106e3e7>] ?
> warn_slowpath_common+0x87/0xc0
> Feb 18 18:29:53 hv5 kernel: [<ffffffff8106e4d6>] ?
> warn_slowpath_fmt+0x46/0x50
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa020bd62>] ?
> igb_get_drvinfo+0x82/0xe0 [igb]
> Feb 18 18:29:53 hv5 kernel: [<ffffffff81448e7f>] ?
> skb_gso_segment+0x1df/0x2b0
> Feb 18 18:29:53 hv5 kernel: [<ffffffff81449260>] ?
> dev_hard_start_xmit+0x1b0/0x530
> Feb 18 18:29:53 hv5 kernel: [<ffffffff8146773a>] ?
> sch_direct_xmit+0x15a/0x1c0
> Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d0c0>] ?
> dev_queue_xmit+0x3b0/0x550
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af65c>] ?
> br_dev_queue_push_xmit+0x6c/0xa0 [bridge]
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af6e8>] ?
> br_forward_finish+0x58/0x60 [bridge]
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af79a>] ? __br_forward+0xaa/0xd0
> [bridge]
> Feb 18 18:29:53 hv5 kernel: [<ffffffff81474f34>] ? nf_hook_slow+0x74/0x110
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af81d>] ? br_forward+0x5d/0x70
> [bridge]
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa04b0609>] ?
> br_handle_frame_finish+0x179/0x2a0 [bridge]
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa04b08da>] ?
> br_handle_frame+0x1aa/0x250 [bridge]
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa0331690>] ? pit_timer_fn+0x0/0x80
> [kvm]
> Feb 18 18:29:53 hv5 kernel: [<ffffffff81448929>] ?
> __netif_receive_skb+0x529/0x750
> Feb 18 18:29:53 hv5 kernel: [<ffffffff81448bea>] ?
> process_backlog+0x9a/0x100
> Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d453>] ?
> net_rx_action+0x103/0x2f0
> Feb 18 18:29:53 hv5 kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0
> Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
> Feb 18 18:29:53 hv5 kernel: <EOI> [<ffffffff8100de05>] ?
> do_softirq+0x65/0xa0
> Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d8d8>] ? netif_rx_ni+0x28/0x30
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa02b7749>] ? tun_sendmsg+0x229/0x4ec
> [tun]
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa037bcf5>] ? handle_tx+0x275/0x5e0
> [vhost_net]
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa037c095>] ?
> handle_tx_kick+0x15/0x20 [vhost_net]
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa037955c>] ? vhost_worker+0xbc/0x140
> [vhost_net]
> Feb 18 18:29:53 hv5 kernel: [<ffffffffa03794a0>] ? vhost_worker+0x0/0x140
> [vhost_net]
> Feb 18 18:29:53 hv5 kernel: [<ffffffff81096a36>] ? kthread+0x96/0xa0
> Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
> Feb 18 18:29:53 hv5 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
> Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
> Feb 18 18:29:53 hv5 kernel: ---[ end trace 2ae4b3142333fe7d ]---
>
>
--
Met vriendelijke groeten / With kind regards,
Johan Kooijman
T +31(0) 6 43 44 45 27
F +31(0) 162 82 00 01
E mail at johankooijman.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140223/de2b35d1/attachment.html>
More information about the Users
mailing list