[Users] Nodes lose storage at random

Nir Soffer nsoffer at redhat.com
Sat Feb 22 19:57:55 UTC 2014


----- Original Message -----
> From: "Johan Kooijman" <mail at johankooijman.com>
> To: "Nir Soffer" <nsoffer at redhat.com>
> Cc: "users" <users at ovirt.org>
> Sent: Wednesday, February 19, 2014 2:34:36 PM
> Subject: Re: [Users] Nodes lose storage at random
> 
> Messages: https://t-x.dignus.nl/messages.txt
> Sanlock: https://t-x.dignus.nl/sanlock.log.txt

We can see in /var/log/messages, that sanlock failed to write to 
the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose 
its host id lease. In this case, sanlock kill vdsm [3], which die after 11
retries [4]. Then vdsm is respawned again [5]. This is expected.

We don't know why sanlock failed to write to the storage, but in [6] the
kernel tell us that the nfs server is not responding. Since the nfs server
is accessible from other machines, it means you have some issue with this host.

Later the machine reboots [7], and nfs server is still not accessible. Then
you have lot of WARN_ON call traces [8], that looks related to network code.

We can see that you are not running most recent kernel [7]. We experienced various
nfs issues during the 6.5 beta.

I would try to get help from kernel folks about this.

[1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+0000 1251833 [21345]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.0.24.1:_santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids

[2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+0000 1251882 [14753]: s2 check_our_lease failed 80

[3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+0000 1251882 [14753]: s2 kill 19317 sig 15 count 1

[4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+0000 1251892 [14753]: dead 19317 ci 3 count 11

[5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died, respawning slave

[6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding, timed out

[7]
Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 (mockbuild at c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013

[8]
Feb 18 18:29:53 hv5 kernel: ------------[ cut here ]------------
Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759 skb_gso_segment+0x1df/0x2b0() (Not tainted)
Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW
Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596 data_len=0 ip_summed=0
Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache auth_rpcgss nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe libfcoe libfc scsi_transport_fc scsi_tgt
 lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack xt_multi
port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc sha256_generic cbc cryptoloop dm_crypt aesni_intel cryptd aes_x86_64 aes_generic vhost_net macvtap macvlan tun kvm_
intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci libsas scsi_transport_sas 3w_
sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xx
x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_scan]
Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted 2.6.32-358.18.1.el6.x86_64 #1
Feb 18 18:29:53 hv5 kernel: Call Trace:
Feb 18 18:29:53 hv5 kernel: <IRQ>  [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0
Feb 18 18:29:53 hv5 kernel: [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50
Feb 18 18:29:53 hv5 kernel: [<ffffffffa020bd62>] ? igb_get_drvinfo+0x82/0xe0 [igb]
Feb 18 18:29:53 hv5 kernel: [<ffffffff81448e7f>] ? skb_gso_segment+0x1df/0x2b0
Feb 18 18:29:53 hv5 kernel: [<ffffffff81449260>] ? dev_hard_start_xmit+0x1b0/0x530
Feb 18 18:29:53 hv5 kernel: [<ffffffff8146773a>] ? sch_direct_xmit+0x15a/0x1c0
Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d0c0>] ? dev_queue_xmit+0x3b0/0x550
Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af65c>] ? br_dev_queue_push_xmit+0x6c/0xa0 [bridge]
Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af6e8>] ? br_forward_finish+0x58/0x60 [bridge]
Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af79a>] ? __br_forward+0xaa/0xd0 [bridge]
Feb 18 18:29:53 hv5 kernel: [<ffffffff81474f34>] ? nf_hook_slow+0x74/0x110
Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af81d>] ? br_forward+0x5d/0x70 [bridge]
Feb 18 18:29:53 hv5 kernel: [<ffffffffa04b0609>] ? br_handle_frame_finish+0x179/0x2a0 [bridge]
Feb 18 18:29:53 hv5 kernel: [<ffffffffa04b08da>] ? br_handle_frame+0x1aa/0x250 [bridge]
Feb 18 18:29:53 hv5 kernel: [<ffffffffa0331690>] ? pit_timer_fn+0x0/0x80 [kvm]
Feb 18 18:29:53 hv5 kernel: [<ffffffff81448929>] ? __netif_receive_skb+0x529/0x750
Feb 18 18:29:53 hv5 kernel: [<ffffffff81448bea>] ? process_backlog+0x9a/0x100
Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d453>] ? net_rx_action+0x103/0x2f0
Feb 18 18:29:53 hv5 kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0
Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Feb 18 18:29:53 hv5 kernel: <EOI>  [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d8d8>] ? netif_rx_ni+0x28/0x30
Feb 18 18:29:53 hv5 kernel: [<ffffffffa02b7749>] ? tun_sendmsg+0x229/0x4ec [tun]
Feb 18 18:29:53 hv5 kernel: [<ffffffffa037bcf5>] ? handle_tx+0x275/0x5e0 [vhost_net]
Feb 18 18:29:53 hv5 kernel: [<ffffffffa037c095>] ? handle_tx_kick+0x15/0x20 [vhost_net]
Feb 18 18:29:53 hv5 kernel: [<ffffffffa037955c>] ? vhost_worker+0xbc/0x140 [vhost_net]
Feb 18 18:29:53 hv5 kernel: [<ffffffffa03794a0>] ? vhost_worker+0x0/0x140 [vhost_net]
Feb 18 18:29:53 hv5 kernel: [<ffffffff81096a36>] ? kthread+0x96/0xa0
Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Feb 18 18:29:53 hv5 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Feb 18 18:29:53 hv5 kernel: ---[ end trace 2ae4b3142333fe7d ]---




More information about the Users mailing list