Hi Jurrien,

I don't see anything in logs on the nodes itself. The only thing we see in logs are in engine log - it looses connectivity to the host.
Definitely CentOS 7.1/7.2 related. Downgraded the hosts to ovirt-iso 3.5, this resolves the issue.

On Fri, Mar 18, 2016 at 9:01 AM, Bloemen, Jurriën <Jurrien.Bloemen@dmc.amcnetworks.com> wrote:
Hi Johan,

Could you check if you see the following in you dmesg or message log file?

[1123306.014288] ------------[ cut here ]------------
[1123306.014302] WARNING: at net/core/dev.c:2189 skb_warn_bad_offload+0xcd/0xda()
[1123306.014306] : caps=(0x0000000200004849, 0x0000000000000000) len=330 data_len=276 gso_size=276 gso_type=1 ip_summed=1
[1123306.014308] Modules linked in: vhost_net macvtap macvlan ip6table_filter ip6_tables iptable_filter ip_tables ebt_arp ebtable_nat ebtables tun scsi_transport_iscsi iTCO_wdt iTCO_vendor_support dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd pcspkr sb_edac edac_core i2c_i801 lpc_ich mfd_core mei_me mei wmi ioatdma shpchp ipmi_devintf ipmi_si ipmi_msghandler acpi_power_meter acpi_pad 8021q garp mrp bridge stp llc bonding dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_helper ttm crc32c_intel igb drm ahci ixgbe i2c_algo_bit libahci libata mdio i2c_core ptp megaraid_sas pps_core dca dm_mirror dm_region_hash dm_log dm_mod
[1123306.014360] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G        W   --------------   3.10.0-229.1.2.el7.x86_64 #1
[1123306.014362] Hardware name: Supermicro SYS-2028TP-HC1TR/X10DRT-PT, BIOS 1.1 08/03/2015
[1123306.014364]  ffff881fffc439a8 5326fb90ad1041ea ffff881fffc43960 ffffffff81604afa
[1123306.014371]  ffff881fffc43998 ffffffff8106e34b ffff881fcebb0500 ffff881fce88c000
[1123306.014376]  0000000000000001 0000000000000001 ffff881fcebb0500 ffff881fffc43a00
[1123306.014381] Call Trace:
[1123306.014383]  <IRQ>  [<ffffffff81604afa>] dump_stack+0x19/0x1b
[1123306.014396]  [<ffffffff8106e34b>] warn_slowpath_common+0x6b/0xb0
[1123306.014399]  [<ffffffff8106e3ec>] warn_slowpath_fmt+0x5c/0x80
[1123306.014405]  [<ffffffff812db093>] ? ___ratelimit+0x93/0x100
[1123306.014409]  [<ffffffff816076c3>] skb_warn_bad_offload+0xcd/0xda
[1123306.014425]  [<ffffffff814fdeb9>] __skb_gso_segment+0x79/0xb0
[1123306.014429]  [<ffffffff814fe1c2>] dev_hard_start_xmit+0x1a2/0x580
[1123306.014438]  [<ffffffffa0168790>] ? deliver_clone+0x50/0x50 [bridge]
[1123306.014443]  [<ffffffff8151df1e>] sch_direct_xmit+0xee/0x1c0
[1123306.014447]  [<ffffffff814fe798>] dev_queue_xmit+0x1f8/0x4a0
[1123306.014453]  [<ffffffffa016880b>] br_dev_queue_push_xmit+0x7b/0xc0 [bridge]
[1123306.014458]  [<ffffffffa0168a22>] br_forward_finish+0x22/0x60 [bridge]
[1123306.014464]  [<ffffffffa0168ae0>] __br_forward+0x80/0xf0 [bridge]
[1123306.014469]  [<ffffffffa0168ebb>] br_forward+0x8b/0xa0 [bridge]
[1123306.014476]  [<ffffffffa0169e65>] br_handle_frame_finish+0x175/0x410 [bridge]
[1123306.014481]  [<ffffffffa016a275>] br_handle_frame+0x175/0x260 [bridge]
[1123306.014485]  [<ffffffff814fc112>] __netif_receive_skb_core+0x282/0x870
[1123306.014490]  [<ffffffff8101b589>] ? read_tsc+0x9/0x10
[1123306.014493]  [<ffffffff814fc718>] __netif_receive_skb+0x18/0x60
[1123306.014497]  [<ffffffff814fc7a0>] netif_receive_skb+0x40/0xd0
[1123306.014500]  [<ffffffff814fd2b0>] napi_gro_receive+0x80/0xb0
[1123306.014512]  [<ffffffffa00cde2c>] ixgbe_clean_rx_irq+0x7ac/0xb30 [ixgbe]
[1123306.014519]  [<ffffffffa00cf07b>] ixgbe_poll+0x4bb/0x930 [ixgbe]
[1123306.014524]  [<ffffffff814fcb62>] net_rx_action+0x152/0x240
[1123306.014528]  [<ffffffff81077bf7>] __do_softirq+0xf7/0x290
[1123306.014533]  [<ffffffff8161635c>] call_softirq+0x1c/0x30
[1123306.014539]  [<ffffffff81015de5>] do_softirq+0x55/0x90
[1123306.014543]  [<ffffffff81077f95>] irq_exit+0x115/0x120
[1123306.014546]  [<ffffffff81616ef8>] do_IRQ+0x58/0xf0
[1123306.014551]  [<ffffffff8160c0ed>] common_interrupt+0x6d/0x6d
[1123306.014553]  <EOI>  [<ffffffff814aa6d2>] ? cpuidle_enter_state+0x52/0xc0
[1123306.014561]  [<ffffffff814aa6c8>] ? cpuidle_enter_state+0x48/0xc0
[1123306.014565]  [<ffffffff814aa805>] cpuidle_idle_call+0xc5/0x200
[1123306.014569]  [<ffffffff8101d21e>] arch_cpu_idle+0xe/0x30
[1123306.014574]  [<ffffffff810c6945>] cpu_startup_entry+0xf5/0x290
[1123306.014580]  [<ffffffff810423ca>] start_secondary+0x1ba/0x230
[1123306.014582] ---[ end trace 4d5a1bc838e1fcc0 ]---

If so, then could you try the following:

ethtool -K <nic name> lro off

Do this for all the 10G intel nics and check if the problems still exists


Kind regards,

 

Jurriën Bloemen


On 17-03-16 09:49, Johan Kooijman wrote:
Hi all,

Since we upgraded to the latest ovirt node running 7.2, we're seeing that nodes become unavailable after a while. It's running fine, with a couple of VM's on it, untill it becomes non responsive. At that moment it doesn't even respond to ICMP. It'll come back by itself after a while, but oVirt fences the machine before that time and restarts VM's elsewhere.

Engine tells me this message:

VDSM host09 command failed: Message timeout which can be caused by communication issues

Is anyone else experiencing these issues with ixgbe drivers? I'm running on Intel X540-AT2 cards.

--
Met vriendelijke groeten / With kind regards,
Johan Kooijman


_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

This message (including any attachments) may contain information that is privileged or confidential. If you are not the intended recipient, please notify the sender and delete this email immediately from your systems and destroy all copies of it. You may not, directly or indirectly, use, disclose, distribute, print or copy this email or any part of it if you are not the intended recipient

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users




--
Met vriendelijke groeten / With kind regards,
Johan Kooijman