Hi Jurrien,
I don't see anything in logs on the nodes itself. The only thing we see in
logs are in engine log - it looses connectivity to the host.
Definitely CentOS 7.1/7.2 related. Downgraded the hosts to ovirt-iso 3.5,
this resolves the issue.
On Fri, Mar 18, 2016 at 9:01 AM, Bloemen, Jurriën <
Jurrien.Bloemen(a)dmc.amcnetworks.com> wrote:
Hi Johan,
Could you check if you see the following in you dmesg or message log file?
[1123306.014288] ------------[ cut here ]------------
[1123306.014302] WARNING: at net/core/dev.c:2189
skb_warn_bad_offload+0xcd/0xda()
[1123306.014306] : caps=(0x0000000200004849, 0x0000000000000000) len=330
data_len=276 gso_size=276 gso_type=1 ip_summed=1
[1123306.014308] Modules linked in: vhost_net macvtap macvlan
ip6table_filter ip6_tables iptable_filter ip_tables ebt_arp ebtable_nat
ebtables tun scsi_transport_iscsi iTCO_wdt iTCO_vendor_support
dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel kvm
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd pcspkr sb_edac
edac_core i2c_i801 lpc_ich mfd_core mei_me mei wmi ioatdma shpchp
ipmi_devintf ipmi_si ipmi_msghandler acpi_power_meter acpi_pad 8021q garp
mrp bridge stp llc bonding dm_multipath xfs libcrc32c sd_mod crc_t10dif
crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_helper ttm
crc32c_intel igb drm ahci ixgbe i2c_algo_bit libahci libata mdio i2c_core
ptp megaraid_sas pps_core dca dm_mirror dm_region_hash dm_log dm_mod
[1123306.014360] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G W
-------------- 3.10.0-229.1.2.el7.x86_64 #1
[1123306.014362] Hardware name: Supermicro SYS-2028TP-HC1TR/X10DRT-PT,
BIOS 1.1 08/03/2015
[1123306.014364] ffff881fffc439a8 5326fb90ad1041ea ffff881fffc43960
ffffffff81604afa
[1123306.014371] ffff881fffc43998 ffffffff8106e34b ffff881fcebb0500
ffff881fce88c000
[1123306.014376] 0000000000000001 0000000000000001 ffff881fcebb0500
ffff881fffc43a00
[1123306.014381] Call Trace:
[1123306.014383] <IRQ> [<ffffffff81604afa>] dump_stack+0x19/0x1b
[1123306.014396] [<ffffffff8106e34b>] warn_slowpath_common+0x6b/0xb0
[1123306.014399] [<ffffffff8106e3ec>] warn_slowpath_fmt+0x5c/0x80
[1123306.014405] [<ffffffff812db093>] ? ___ratelimit+0x93/0x100
[1123306.014409] [<ffffffff816076c3>] skb_warn_bad_offload+0xcd/0xda
[1123306.014425] [<ffffffff814fdeb9>] __skb_gso_segment+0x79/0xb0
[1123306.014429] [<ffffffff814fe1c2>] dev_hard_start_xmit+0x1a2/0x580
[1123306.014438] [<ffffffffa0168790>] ? deliver_clone+0x50/0x50 [bridge]
[1123306.014443] [<ffffffff8151df1e>] sch_direct_xmit+0xee/0x1c0
[1123306.014447] [<ffffffff814fe798>] dev_queue_xmit+0x1f8/0x4a0
[1123306.014453] [<ffffffffa016880b>] br_dev_queue_push_xmit+0x7b/0xc0
[bridge]
[1123306.014458] [<ffffffffa0168a22>] br_forward_finish+0x22/0x60 [bridge]
[1123306.014464] [<ffffffffa0168ae0>] __br_forward+0x80/0xf0 [bridge]
[1123306.014469] [<ffffffffa0168ebb>] br_forward+0x8b/0xa0 [bridge]
[1123306.014476] [<ffffffffa0169e65>] br_handle_frame_finish+0x175/0x410
[bridge]
[1123306.014481] [<ffffffffa016a275>] br_handle_frame+0x175/0x260 [bridge]
[1123306.014485] [<ffffffff814fc112>] __netif_receive_skb_core+0x282/0x870
[1123306.014490] [<ffffffff8101b589>] ? read_tsc+0x9/0x10
[1123306.014493] [<ffffffff814fc718>] __netif_receive_skb+0x18/0x60
[1123306.014497] [<ffffffff814fc7a0>] netif_receive_skb+0x40/0xd0
[1123306.014500] [<ffffffff814fd2b0>] napi_gro_receive+0x80/0xb0
[1123306.014512] [<ffffffffa00cde2c>] ixgbe_clean_rx_irq+0x7ac/0xb30
[ixgbe]
[1123306.014519] [<ffffffffa00cf07b>] ixgbe_poll+0x4bb/0x930 [ixgbe]
[1123306.014524] [<ffffffff814fcb62>] net_rx_action+0x152/0x240
[1123306.014528] [<ffffffff81077bf7>] __do_softirq+0xf7/0x290
[1123306.014533] [<ffffffff8161635c>] call_softirq+0x1c/0x30
[1123306.014539] [<ffffffff81015de5>] do_softirq+0x55/0x90
[1123306.014543] [<ffffffff81077f95>] irq_exit+0x115/0x120
[1123306.014546] [<ffffffff81616ef8>] do_IRQ+0x58/0xf0
[1123306.014551] [<ffffffff8160c0ed>] common_interrupt+0x6d/0x6d
[1123306.014553] <EOI> [<ffffffff814aa6d2>] ?
cpuidle_enter_state+0x52/0xc0
[1123306.014561] [<ffffffff814aa6c8>] ? cpuidle_enter_state+0x48/0xc0
[1123306.014565] [<ffffffff814aa805>] cpuidle_idle_call+0xc5/0x200
[1123306.014569] [<ffffffff8101d21e>] arch_cpu_idle+0xe/0x30
[1123306.014574] [<ffffffff810c6945>] cpu_startup_entry+0xf5/0x290
[1123306.014580] [<ffffffff810423ca>] start_secondary+0x1ba/0x230
[1123306.014582] ---[ end trace 4d5a1bc838e1fcc0 ]---
If so, then could you try the following:
ethtool -K <nic name> lro off
Do this for all the 10G intel nics and check if the problems still exists
*Kind regards,*
*Jurriën Bloemen*
On 17-03-16 09:49, Johan Kooijman wrote:
Hi all,
Since we upgraded to the latest ovirt node running 7.2, we're seeing that
nodes become unavailable after a while. It's running fine, with a couple of
VM's on it, untill it becomes non responsive. At that moment it doesn't
even respond to ICMP. It'll come back by itself after a while, but oVirt
fences the machine before that time and restarts VM's elsewhere.
Engine tells me this message:
VDSM host09 command failed: Message timeout which can be caused by
communication issues
Is anyone else experiencing these issues with ixgbe drivers? I'm running
on Intel X540-AT2 cards.
--
Met vriendelijke groeten / With kind regards,
Johan Kooijman
_______________________________________________
Users mailing listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users
This message (including any attachments) may contain information that is
privileged or confidential. If you are not the intended recipient, please
notify the sender and delete this email immediately from your systems and
destroy all copies of it. You may not, directly or indirectly, use,
disclose, distribute, print or copy this email or any part of it if you are
not the intended recipient
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
--
Met vriendelijke groeten / With kind regards,
Johan Kooijman