
Hi all, Since we upgraded to the latest ovirt node running 7.2, we're seeing that nodes become unavailable after a while. It's running fine, with a couple of VM's on it, untill it becomes non responsive. At that moment it doesn't even respond to ICMP. It'll come back by itself after a while, but oVirt fences the machine before that time and restarts VM's elsewhere. Engine tells me this message: VDSM host09 command failed: Message timeout which can be caused by communication issues Is anyone else experiencing these issues with ixgbe drivers? I'm running on Intel X540-AT2 cards. -- Met vriendelijke groeten / With kind regards, Johan Kooijman

On Thu, Mar 17, 2016 at 10:49 AM, Johan Kooijman <mail@johankooijman.com> wrote:
Hi all,
Since we upgraded to the latest ovirt node running 7.2, we're seeing that nodes become unavailable after a while. It's running fine, with a couple of VM's on it, untill it becomes non responsive. At that moment it doesn't even respond to ICMP. It'll come back by itself after a while, but oVirt fences the machine before that time and restarts VM's elsewhere.
Engine tells me this message:
VDSM host09 command failed: Message timeout which can be caused by communication issues
Is anyone else experiencing these issues with ixgbe drivers? I'm running on Intel X540-AT2 cards.
We will need engine and vdsm logs to understand this issue. Can you file a bug and attach ful logs? Nir

Johan, It there is temporary networking issue and you still want engine not to fence the host you can increase heartbeat interval in the engine configuration. It would tell engine to wait longer before assuming that the host is not responding. Please provide the logs so we can understand why there is communication issue in the first place. Thanks, Piotr On Thu, Mar 17, 2016 at 12:52 PM, Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Mar 17, 2016 at 10:49 AM, Johan Kooijman <mail@johankooijman.com> wrote:
Hi all,
Since we upgraded to the latest ovirt node running 7.2, we're seeing that nodes become unavailable after a while. It's running fine, with a couple of VM's on it, untill it becomes non responsive. At that moment it doesn't even respond to ICMP. It'll come back by itself after a while, but oVirt fences the machine before that time and restarts VM's elsewhere.
Engine tells me this message:
VDSM host09 command failed: Message timeout which can be caused by communication issues
Is anyone else experiencing these issues with ixgbe drivers? I'm running on Intel X540-AT2 cards.
We will need engine and vdsm logs to understand this issue.
Can you file a bug and attach ful logs?
Nir

Hi, Is this on CentOS/RHEL 7.2? Log in as root as see if you can see any messages from ixgbe about "tx queue hung" in dmesg. I currently have an open support case for RHEL7.2 and the ixgbe driver, where there is a driver issue causing the network adapter to reset continuously when there are network traffic. Regards, Siggi On Thu, March 17, 2016 12:52, Nir Soffer wrote:
On Thu, Mar 17, 2016 at 10:49 AM, Johan Kooijman <mail@johankooijman.com> wrote:
Hi all,
Since we upgraded to the latest ovirt node running 7.2, we're seeing that nodes become unavailable after a while. It's running fine, with a couple of VM's on it, untill it becomes non responsive. At that moment it doesn't even respond to ICMP. It'll come back by itself after a while, but oVirt fences the machine before that time and restarts VM's elsewhere.
Engine tells me this message:
VDSM host09 command failed: Message timeout which can be caused by communication issues
Is anyone else experiencing these issues with ixgbe drivers? I'm running on Intel X540-AT2 cards.
We will need engine and vdsm logs to understand this issue.
Can you file a bug and attach ful logs?
Nir _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

I had the same issue, and I also have a support case open. They referenced https://bugzilla.redhat.com/show_bug.cgi?id=1288237 which is private. I didn't have any success getting that bugzilla changed to public. We couldn't keep waiting for the issue to be fixed so we replaced the NICs with Broadcom/Qlogic that we knew had no issues in other hosts. On Thu, Mar 17, 2016 at 11:27 AM, Sigbjorn Lie <sigbjorn@nixtra.com> wrote:
Hi,
Is this on CentOS/RHEL 7.2?
Log in as root as see if you can see any messages from ixgbe about "tx queue hung" in dmesg. I currently have an open support case for RHEL7.2 and the ixgbe driver, where there is a driver issue causing the network adapter to reset continuously when there are network traffic.
Regards, Siggi
On Thu, Mar 17, 2016 at 10:49 AM, Johan Kooijman <mail@johankooijman.com> wrote:
Hi all,
Since we upgraded to the latest ovirt node running 7.2, we're seeing
On Thu, March 17, 2016 12:52, Nir Soffer wrote: that
nodes become unavailable after a while. It's running fine, with a couple of VM's on it, untill it becomes non responsive. At that moment it doesn't even respond to ICMP. It'll come back by itself after a while, but oVirt fences the machine before that time and restarts VM's elsewhere.
Engine tells me this message:
VDSM host09 command failed: Message timeout which can be caused by communication issues
Is anyone else experiencing these issues with ixgbe drivers? I'm running on Intel X540-AT2 cards.
We will need engine and vdsm logs to understand this issue.
Can you file a bug and attach ful logs?
Nir _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Hi Jeff, was the issue ever resolved? Don't have permissions to view the bugzilla. On Thu, Mar 17, 2016 at 4:34 PM, Jeff Spahr <spahrj@gmail.com> wrote:
I had the same issue, and I also have a support case open. They referenced https://bugzilla.redhat.com/show_bug.cgi?id=1288237 which is private. I didn't have any success getting that bugzilla changed to public. We couldn't keep waiting for the issue to be fixed so we replaced the NICs with Broadcom/Qlogic that we knew had no issues in other hosts.
On Thu, Mar 17, 2016 at 11:27 AM, Sigbjorn Lie <sigbjorn@nixtra.com> wrote:
Hi,
Is this on CentOS/RHEL 7.2?
Log in as root as see if you can see any messages from ixgbe about "tx queue hung" in dmesg. I currently have an open support case for RHEL7.2 and the ixgbe driver, where there is a driver issue causing the network adapter to reset continuously when there are network traffic.
Regards, Siggi
On Thu, Mar 17, 2016 at 10:49 AM, Johan Kooijman < mail@johankooijman.com> wrote:
Hi all,
Since we upgraded to the latest ovirt node running 7.2, we're seeing
On Thu, March 17, 2016 12:52, Nir Soffer wrote: that
nodes become unavailable after a while. It's running fine, with a couple of VM's on it, untill it becomes non responsive. At that moment it doesn't even respond to ICMP. It'll come back by itself after a while, but oVirt fences the machine before that time and restarts VM's elsewhere.
Engine tells me this message:
VDSM host09 command failed: Message timeout which can be caused by communication issues
Is anyone else experiencing these issues with ixgbe drivers? I'm running on Intel X540-AT2 cards.
We will need engine and vdsm logs to understand this issue.
Can you file a bug and attach ful logs?
Nir _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Met vriendelijke groeten / With kind regards, Johan Kooijman

This is a multi-part message in MIME format. --------------C1E6C0A57787BAEC7E6BD4CD Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Hi Johan, On 07/18/2016 09:53 AM, Johan Kooijman wrote:
Hi Jeff,
was the issue ever resolved? Don't have permissions to view the bugzilla.
There are proposal patches in the bugzilla, I have requested more information about upstream status. As soon I have updates, I will reply here. For now, if you have the hardware and want to give a test against our latest upstream build jobs, links below: ovirt-node 3.6: http://jenkins.ovirt.org/job/ovirt-node_ovirt-3.6_create-iso-el7_merged/ ovirt-node 4.0 (next): http://jenkins.ovirt.org/job/ovirt-node-ng_ovirt-4.0-snapshot_build-artifact... Thanks!
On Thu, Mar 17, 2016 at 4:34 PM, Jeff Spahr <spahrj@gmail.com <mailto:spahrj@gmail.com>> wrote:
I had the same issue, and I also have a support case open. They referenced https://bugzilla.redhat.com/show_bug.cgi?id=1288237 which is private. I didn't have any success getting that bugzilla changed to public. We couldn't keep waiting for the issue to be fixed so we replaced the NICs with Broadcom/Qlogic that we knew had no issues in other hosts.
On Thu, Mar 17, 2016 at 11:27 AM, Sigbjorn Lie <sigbjorn@nixtra.com <mailto:sigbjorn@nixtra.com>> wrote:
Hi,
Is this on CentOS/RHEL 7.2?
Log in as root as see if you can see any messages from ixgbe about "tx queue hung" in dmesg. I currently have an open support case for RHEL7.2 and the ixgbe driver, where there is a driver issue causing the network adapter to reset continuously when there are network traffic.
Regards, Siggi
On Thu, March 17, 2016 12:52, Nir Soffer wrote: > On Thu, Mar 17, 2016 at 10:49 AM, Johan Kooijman <mail@johankooijman.com <mailto:mail@johankooijman.com>> wrote: > >> Hi all, >> >> >> Since we upgraded to the latest ovirt node running 7.2, we're seeing that >> nodes become unavailable after a while. It's running fine, with a couple of VM's on it, untill it >> becomes non responsive. At that moment it doesn't even respond to ICMP. It'll come back by >> itself after a while, but oVirt fences the machine before that time and restarts VM's elsewhere. >> >> >> Engine tells me this message: >> >> >> VDSM host09 command failed: Message timeout which can be caused by >> communication issues >> >> Is anyone else experiencing these issues with ixgbe drivers? I'm running on >> Intel X540-AT2 cards. >> > > We will need engine and vdsm logs to understand this issue. > > > Can you file a bug and attach ful logs? > > > Nir > _______________________________________________ > Users mailing list > Users@ovirt.org <mailto:Users@ovirt.org> > http://lists.ovirt.org/mailman/listinfo/users > >
_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users
-- Met vriendelijke groeten / With kind regards, Johan Kooijman
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--------------C1E6C0A57787BAEC7E6BD4CD Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <p>Hi Johan,<br> </p> <br> <div class="moz-cite-prefix">On 07/18/2016 09:53 AM, Johan Kooijman wrote:<br> </div> <blockquote cite="mid:CAHvs-HX2=e9qR61g9vu9inVikaspxFJgR8aRZwGOpUWN0wC1rQ@mail.gmail.com" type="cite"> <div dir="ltr">Hi Jeff, <div><br> </div> <div>was the issue ever resolved? Don't have permissions to view the bugzilla.</div> </div> </blockquote> <br> There are proposal patches in the bugzilla, I have requested more information about upstream status.<br> As soon I have updates, I will reply here. <br> <br> For now, if you have the hardware and want to give a test against our latest upstream build jobs, links below:<br> <br> ovirt-node 3.6:<br> <a class="moz-txt-link-freetext" href="http://jenkins.ovirt.org/job/ovirt-node_ovirt-3.6_create-iso-el7_merged/">http://jenkins.ovirt.org/job/ovirt-node_ovirt-3.6_create-iso-el7_merged/</a><br> <br> ovirt-node 4.0 (next):<br> <a class="moz-txt-link-freetext" href="http://jenkins.ovirt.org/job/ovirt-node-ng_ovirt-4.0-snapshot_build-artifacts-fc23-x86_64/">http://jenkins.ovirt.org/job/ovirt-node-ng_ovirt-4.0-snapshot_build-artifacts-fc23-x86_64/</a><br> <br> Thanks!<br> <br> <blockquote cite="mid:CAHvs-HX2=e9qR61g9vu9inVikaspxFJgR8aRZwGOpUWN0wC1rQ@mail.gmail.com" type="cite"> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Mar 17, 2016 at 4:34 PM, Jeff Spahr <span dir="ltr"><<a moz-do-not-send="true" href="mailto:spahrj@gmail.com" target="_blank">spahrj@gmail.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div dir="ltr">I had the same issue, and I also have a support case open. They referenced <a moz-do-not-send="true" href="https://bugzilla.redhat.com/show_bug.cgi?id=1288237" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1288237</a> which is private. I didn't have any success getting that bugzilla changed to public. We couldn't keep waiting for the issue to be fixed so we replaced the NICs with Broadcom/Qlogic that we knew had no issues in other hosts.<br> </div> <div class="HOEnZb"> <div class="h5"> <div class="gmail_extra"><br> <div class="gmail_quote">On Thu, Mar 17, 2016 at 11:27 AM, Sigbjorn Lie <span dir="ltr"><<a moz-do-not-send="true" href="mailto:sigbjorn@nixtra.com" target="_blank">sigbjorn@nixtra.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br> <br> Is this on CentOS/RHEL 7.2?<br> <br> Log in as root as see if you can see any messages from ixgbe about "tx queue hung" in dmesg. I<br> currently have an open support case for RHEL7.2 and the ixgbe driver, where there is a driver<br> issue causing the network adapter to reset continuously when there are network traffic.<br> <br> <br> Regards,<br> Siggi<br> <br> <br> <br> On Thu, March 17, 2016 12:52, Nir Soffer wrote:<br> > On Thu, Mar 17, 2016 at 10:49 AM, Johan Kooijman <<a moz-do-not-send="true" href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>> wrote:<br> ><br> >> Hi all,<br> >><br> >><br> >> Since we upgraded to the latest ovirt node running 7.2, we're seeing that<br> >> nodes become unavailable after a while. It's running fine, with a couple of VM's on it, untill it<br> >> becomes non responsive. At that moment it doesn't even respond to ICMP. It'll come back by<br> >> itself after a while, but oVirt fences the machine before that time and restarts VM's elsewhere.<br> >><br> >><br> >> Engine tells me this message:<br> >><br> >><br> >> VDSM host09 command failed: Message timeout which can be caused by<br> >> communication issues<br> >><br> >> Is anyone else experiencing these issues with ixgbe drivers? I'm running on<br> >> Intel X540-AT2 cards.<br> >><br> ><br> > We will need engine and vdsm logs to understand this issue.<br> ><br> ><br> > Can you file a bug and attach ful logs?<br> ><br> ><br> > Nir<br> > _______________________________________________<br> > Users mailing list<br> > <a moz-do-not-send="true" href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br> > <a moz-do-not-send="true" href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br> ><br> ><br> <br> <br> _______________________________________________<br> Users mailing list<br> <a moz-do-not-send="true" href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br> <a moz-do-not-send="true" href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br> </blockquote> </div> <br> </div> </div> </div> <br> _______________________________________________<br> Users mailing list<br> <a moz-do-not-send="true" href="mailto:Users@ovirt.org">Users@ovirt.org</a><br> <a moz-do-not-send="true" href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br> <br> </blockquote> </div> <br> <br clear="all"> <div><br> </div> -- <br> <div class="gmail_signature" data-smartmail="gmail_signature"> <div dir="ltr">Met vriendelijke groeten / With kind regards,<br> Johan Kooijman<br> </div> </div> </div> <br> <fieldset class="mimeAttachmentHeader"></fieldset> <br> <pre wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <br> </body> </html> --------------C1E6C0A57787BAEC7E6BD4CD--

--_000_56EBB5D93000101dmcamcnetworkscom_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Hi Johan, Could you check if you see the following in you dmesg or message log file? [1123306.014288] ------------[ cut here ]------------ [1123306.014302] WARNING: at net/core/dev.c:2189 skb_warn_bad_offload+0xcd/= 0xda() [1123306.014306] : caps=3D(0x0000000200004849, 0x0000000000000000) len=3D33= 0 data_len=3D276 gso_size=3D276 gso_type=3D1 ip_summed=3D1 [1123306.014308] Modules linked in: vhost_net macvtap macvlan ip6table_filt= er ip6_tables iptable_filter ip_tables ebt_arp ebtable_nat ebtables tun scs= i_transport_iscsi iTCO_wdt iTCO_vendor_support dm_service_time intel_powerc= lamp coretemp intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_= clmulni_intel cryptd pcspkr sb_edac edac_core i2c_i801 lpc_ich mfd_core mei= _me mei wmi ioatdma shpchp ipmi_devintf ipmi_si ipmi_msghandler acpi_power_= meter acpi_pad 8021q garp mrp bridge stp llc bonding dm_multipath xfs libcr= c32c sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgb= lt drm_kms_helper ttm crc32c_intel igb drm ahci ixgbe i2c_algo_bit libahci = libata mdio i2c_core ptp megaraid_sas pps_core dca dm_mirror dm_region_hash= dm_log dm_mod [1123306.014360] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G W ----= ---------- 3.10.0-229.1.2.el7.x86_64 #1 [1123306.014362] Hardware name: Supermicro SYS-2028TP-HC1TR/X10DRT-PT, BIOS= 1.1 08/03/2015 [1123306.014364] ffff881fffc439a8 5326fb90ad1041ea ffff881fffc43960 ffffff= ff81604afa [1123306.014371] ffff881fffc43998 ffffffff8106e34b ffff881fcebb0500 ffff88= 1fce88c000 [1123306.014376] 0000000000000001 0000000000000001 ffff881fcebb0500 ffff88= 1fffc43a00 [1123306.014381] Call Trace: [1123306.014383] <IRQ> [<ffffffff81604afa>] dump_stack+0x19/0x1b [1123306.014396] [<ffffffff8106e34b>] warn_slowpath_common+0x6b/0xb0 [1123306.014399] [<ffffffff8106e3ec>] warn_slowpath_fmt+0x5c/0x80 [1123306.014405] [<ffffffff812db093>] ? ___ratelimit+0x93/0x100 [1123306.014409] [<ffffffff816076c3>] skb_warn_bad_offload+0xcd/0xda [1123306.014425] [<ffffffff814fdeb9>] __skb_gso_segment+0x79/0xb0 [1123306.014429] [<ffffffff814fe1c2>] dev_hard_start_xmit+0x1a2/0x580 [1123306.014438] [<ffffffffa0168790>] ? deliver_clone+0x50/0x50 [bridge] [1123306.014443] [<ffffffff8151df1e>] sch_direct_xmit+0xee/0x1c0 [1123306.014447] [<ffffffff814fe798>] dev_queue_xmit+0x1f8/0x4a0 [1123306.014453] [<ffffffffa016880b>] br_dev_queue_push_xmit+0x7b/0xc0 [br= idge] [1123306.014458] [<ffffffffa0168a22>] br_forward_finish+0x22/0x60 [bridge] [1123306.014464] [<ffffffffa0168ae0>] __br_forward+0x80/0xf0 [bridge] [1123306.014469] [<ffffffffa0168ebb>] br_forward+0x8b/0xa0 [bridge] [1123306.014476] [<ffffffffa0169e65>] br_handle_frame_finish+0x175/0x410 [= bridge] [1123306.014481] [<ffffffffa016a275>] br_handle_frame+0x175/0x260 [bridge] [1123306.014485] [<ffffffff814fc112>] __netif_receive_skb_core+0x282/0x870 [1123306.014490] [<ffffffff8101b589>] ? read_tsc+0x9/0x10 [1123306.014493] [<ffffffff814fc718>] __netif_receive_skb+0x18/0x60 [1123306.014497] [<ffffffff814fc7a0>] netif_receive_skb+0x40/0xd0 [1123306.014500] [<ffffffff814fd2b0>] napi_gro_receive+0x80/0xb0 [1123306.014512] [<ffffffffa00cde2c>] ixgbe_clean_rx_irq+0x7ac/0xb30 [ixgb= e] [1123306.014519] [<ffffffffa00cf07b>] ixgbe_poll+0x4bb/0x930 [ixgbe] [1123306.014524] [<ffffffff814fcb62>] net_rx_action+0x152/0x240 [1123306.014528] [<ffffffff81077bf7>] __do_softirq+0xf7/0x290 [1123306.014533] [<ffffffff8161635c>] call_softirq+0x1c/0x30 [1123306.014539] [<ffffffff81015de5>] do_softirq+0x55/0x90 [1123306.014543] [<ffffffff81077f95>] irq_exit+0x115/0x120 [1123306.014546] [<ffffffff81616ef8>] do_IRQ+0x58/0xf0 [1123306.014551] [<ffffffff8160c0ed>] common_interrupt+0x6d/0x6d [1123306.014553] <EOI> [<ffffffff814aa6d2>] ? cpuidle_enter_state+0x52/0x= c0 [1123306.014561] [<ffffffff814aa6c8>] ? cpuidle_enter_state+0x48/0xc0 [1123306.014565] [<ffffffff814aa805>] cpuidle_idle_call+0xc5/0x200 [1123306.014569] [<ffffffff8101d21e>] arch_cpu_idle+0xe/0x30 [1123306.014574] [<ffffffff810c6945>] cpu_startup_entry+0xf5/0x290 [1123306.014580] [<ffffffff810423ca>] start_secondary+0x1ba/0x230 [1123306.014582] ---[ end trace 4d5a1bc838e1fcc0 ]--- If so, then could you try the following: ethtool -K <nic name> lro off Do this for all the 10G intel nics and check if the problems still exists Kind regards, Jurri=EBn Bloemen On 17-03-16 09:49, Johan Kooijman wrote: Hi all, Since we upgraded to the latest ovirt node running 7.2, we're seeing that n= odes become unavailable after a while. It's running fine, with a couple of = VM's on it, untill it becomes non responsive. At that moment it doesn't eve= n respond to ICMP. It'll come back by itself after a while, but oVirt fence= s the machine before that time and restarts VM's elsewhere. Engine tells me this message: VDSM host09 command failed: Message timeout which can be caused by communic= ation issues Is anyone else experiencing these issues with ixgbe drivers? I'm running on= Intel X540-AT2 cards. -- Met vriendelijke groeten / With kind regards, Johan Kooijman _______________________________________________ Users mailing list Users@ovirt.org<mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users This message (including any attachments) may contain information that is pr= ivileged or confidential. If you are not the intended recipient, please not= ify the sender and delete this email immediately from your systems and dest= roy all copies of it. You may not, directly or indirectly, use, disclose, d= istribute, print or copy this email or any part of it if you are not the in= tended recipient --_000_56EBB5D93000101dmcamcnetworkscom_ Content-Type: text/html; charset="Windows-1252" Content-ID: <69C5549CA6C47E439841A1DD44699816@chellomedia.com> Content-Transfer-Encoding: quoted-printable <html> <head> <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1= 252"> </head> <body bgcolor=3D"#FFFFFF" text=3D"#000000"> <tt>Hi Johan,<br> <br> Could you check if you see the following in you dmesg or message log file?<= br> <br> [1123306.014288] ------------[ cut here ]------------<br> [1123306.014302] WARNING: at net/core/dev.c:2189 skb_warn_bad_offload+0= xcd/0xda()<br> [1123306.014306] : caps=3D(0x0000000200004849, 0x0000000000000000) len=3D33= 0 data_len=3D276 gso_size=3D276 gso_type=3D1 ip_summed=3D1<br> [1123306.014308] Modules linked in: vhost_net macvtap macvlan ip6table_filt= er ip6_tables iptable_filter ip_tables ebt_arp ebtable_nat ebtables tun scs= i_transport_iscsi iTCO_wdt iTCO_vendor_support dm_service_time intel_powerc= lamp coretemp intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd pcspkr sb_eda= c edac_core i2c_i801 lpc_ich mfd_core mei_me mei wmi ioatdma shpchp ipmi_de= vintf ipmi_si ipmi_msghandler acpi_power_meter acpi_pad 8021q garp mrp brid= ge stp llc bonding dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillre= ct sysimgblt drm_kms_helper ttm crc32c_intel igb drm ahci ixgbe i2c_algo_bi= t libahci libata mdio i2c_core ptp megaraid_sas pps_core dca dm_mirror dm_r= egion_hash dm_log dm_mod<br> [1123306.014360] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G &nbs= p; W -------------- 3.10.0-= 229.1.2.el7.x86_64 #1<br> [1123306.014362] Hardware name: Supermicro SYS-2028TP-HC1TR/X10DRT-PT, BIOS= 1.1 08/03/2015<br> [1123306.014364] ffff881fffc439a8 5326fb90ad1041ea ffff881fffc43960 f= fffffff81604afa<br> [1123306.014371] ffff881fffc43998 ffffffff8106e34b ffff881fcebb0500 f= fff881fce88c000<br> [1123306.014376] 0000000000000001 0000000000000001 ffff881fcebb0500 f= fff881fffc43a00<br> [1123306.014381] Call Trace:<br> [1123306.014383] <IRQ> [<ffffffff81604afa>] dump_st= ack+0x19/0x1b<br> [1123306.014396] [<ffffffff8106e34b>] warn_slowpath_common+= 0x6b/0xb0<br> [1123306.014399] [<ffffffff8106e3ec>] warn_slowpath_fmt+0x5= c/0x80<br> [1123306.014405] [<ffffffff812db093>] ? ___ratelimit+0x93/0= x100<br> [1123306.014409] [<ffffffff816076c3>] skb_warn_bad_offload+= 0xcd/0xda<br> [1123306.014425] [<ffffffff814fdeb9>] __skb_gso_segment+0x7= 9/0xb0<br> [1123306.014429] [<ffffffff814fe1c2>] dev_hard_start_xmit+0= x1a2/0x580<br> [1123306.014438] [<ffffffffa0168790>] ? deliver_clone+0x50/= 0x50 [bridge]<br> [1123306.014443] [<ffffffff8151df1e>] sch_direct_xmit+0xee/= 0x1c0<br> [1123306.014447] [<ffffffff814fe798>] dev_queue_xmit+0x1f8/= 0x4a0<br> [1123306.014453] [<ffffffffa016880b>] br_dev_queue_push_xmit= 3;0x7b/0xc0 [bridge]<br> [1123306.014458] [<ffffffffa0168a22>] br_forward_finish+0x2= 2/0x60 [bridge]<br> [1123306.014464] [<ffffffffa0168ae0>] __br_forward+0x80/0xf= 0 [bridge]<br> [1123306.014469] [<ffffffffa0168ebb>] br_forward+0x8b/0xa0 = [bridge]<br> [1123306.014476] [<ffffffffa0169e65>] br_handle_frame_finish= 3;0x175/0x410 [bridge]<br> [1123306.014481] [<ffffffffa016a275>] br_handle_frame+0x175= /0x260 [bridge]<br> [1123306.014485] [<ffffffff814fc112>] __netif_receive_skb_core&= #43;0x282/0x870<br> [1123306.014490] [<ffffffff8101b589>] ? read_tsc+0x9/0x10<b= r> [1123306.014493] [<ffffffff814fc718>] __netif_receive_skb+0= x18/0x60<br> [1123306.014497] [<ffffffff814fc7a0>] netif_receive_skb+0x4= 0/0xd0<br> [1123306.014500] [<ffffffff814fd2b0>] napi_gro_receive+0x80= /0xb0<br> [1123306.014512] [<ffffffffa00cde2c>] ixgbe_clean_rx_irq+0x= 7ac/0xb30 [ixgbe]<br> [1123306.014519] [<ffffffffa00cf07b>] ixgbe_poll+0x4bb/0x93= 0 [ixgbe]<br> [1123306.014524] [<ffffffff814fcb62>] net_rx_action+0x152/0= x240<br> [1123306.014528] [<ffffffff81077bf7>] __do_softirq+0xf7/0x2= 90<br> [1123306.014533] [<ffffffff8161635c>] call_softirq+0x1c/0x3= 0<br> [1123306.014539] [<ffffffff81015de5>] do_softirq+0x55/0x90<= br> [1123306.014543] [<ffffffff81077f95>] irq_exit+0x115/0x120<= br> [1123306.014546] [<ffffffff81616ef8>] do_IRQ+0x58/0xf0<br> [1123306.014551] [<ffffffff8160c0ed>] common_interrupt+0x6d= /0x6d<br> [1123306.014553] <EOI> [<ffffffff814aa6d2>] ? cpuid= le_enter_state+0x52/0xc0<br> [1123306.014561] [<ffffffff814aa6c8>] ? cpuidle_enter_state+= ;0x48/0xc0<br> [1123306.014565] [<ffffffff814aa805>] cpuidle_idle_call+0xc= 5/0x200<br> [1123306.014569] [<ffffffff8101d21e>] arch_cpu_idle+0xe/0x3= 0<br> [1123306.014574] [<ffffffff810c6945>] cpu_startup_entry+0xf= 5/0x290<br> [1123306.014580] [<ffffffff810423ca>] start_secondary+0x1ba= /0x230<br> [1123306.014582] ---[ end trace 4d5a1bc838e1fcc0 ]---<br> <br> If so, then could you try the following:<br> <br> ethtool -K <nic name> lro off<br> <br> Do this for all the 10G intel nics and check if the problems still exists <= br> <br> <br> </tt> <div class=3D"moz-signature"> <title></title> <div style=3D"color: rgb(0, 0, 0);"> <p class=3D"MsoNormal" style=3D"font-size: 14px; font-family: Calibri, sans-serif; margin: 0cm 0cm 0.0001pt;"> <b><font color=3D"#2c8cb6" face=3D"Arial,sans-serif"><span style=3D"font-si= ze: 10pt;">K</span><span style=3D"font-size: 13px;">i</span><span style=3D"font-size: 10pt;">nd regards,= </span></font></b></p> <p class=3D"MsoNormal" style=3D"font-size: 11pt; font-family: Calibri, sans-serif; margin: 0cm 0cm 0.0001pt;"> <b><span style=3D"font-size: 10pt; font-family: Arial, sans-serif; color: rgb(44, 140, 182);"> </span></b></p> <p class=3D"MsoNormal" style=3D"font-size: 14px; font-family: Calibri, sans-serif; margin: 0cm 0cm 0.0001pt;"> <b style=3D"font-size: 11pt;"><span style=3D"font-size: 10pt; font-family: Arial, sans-serif; color: rgb(44, 140, 182);">Ju= rri=EBn Bloemen</span></b><b style=3D"font-size: 11pt;"><span style=3D"font= -size: 10pt; font-family: Arial, sans-serif; color: gray;"><br> </span></b><font color=3D"#808080" face=3D"Arial,sans-serif"><span style=3D= "font-size: 10pt;"></span></font></p> <br> </div> </div> <div class=3D"moz-cite-prefix">On 17-03-16 09:49, Johan Kooijman wrote:<br> </div> <blockquote cite=3D"mid:CAHvs-HWDFNrQ1uXuZdXqG9_PNZdUN+OYAUU9ZuWzavHe8y= CoKw@mail.gmail.com" type=3D"cite"> <div dir=3D"ltr">Hi all, <div><br> </div> <div>Since we upgraded to the latest ovirt node running 7.2, we're seeing t= hat nodes become unavailable after a while. It's running fine, with a coupl= e of VM's on it, untill it becomes non responsive. At that moment it doesn'= t even respond to ICMP. It'll come back by itself after a while, but oVirt fences the machine before that tim= e and restarts VM's elsewhere.</div> <div><br> </div> <div>Engine tells me this message:</div> <div><br> </div> <div>VDSM host09 command failed: Message timeout which can be caused by com= munication issues</div> <div><br> </div> <div>Is anyone else experiencing these issues with ixgbe drivers? I'm runni= ng on Intel X540-AT2 cards.<br clear=3D"all"> <div><br> </div> -- <br> <div class=3D"gmail_signature"> <div dir=3D"ltr">Met vriendelijke groeten / With kind regards,<br> Johan Kooijman<br> </div> </div> </div> </div> <br> <fieldset class=3D"mimeAttachmentHeader"></fieldset> <br> <pre wrap=3D"">_______________________________________________ Users mailing list <a class=3D"moz-txt-link-abbreviated" href=3D"mailto:Users@ovirt.org">Users= @ovirt.org</a> <a class=3D"moz-txt-link-freetext" href=3D"http://lists.ovirt.org/mailman/l= istinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <br> This message (including any attachments) may contain information that is pr= ivileged or confidential. If you are not the intended recipient, please not= ify the sender and delete this email immediately from your systems and dest= roy all copies of it. You may not, directly or indirectly, use, disclose, distribute, print or copy this emai= l or any part of it if you are not the intended recipient </body> </html> --_000_56EBB5D93000101dmcamcnetworkscom_--

Hi Jurrien, I don't see anything in logs on the nodes itself. The only thing we see in logs are in engine log - it looses connectivity to the host. Definitely CentOS 7.1/7.2 related. Downgraded the hosts to ovirt-iso 3.5, this resolves the issue. On Fri, Mar 18, 2016 at 9:01 AM, Bloemen, Jurriën < Jurrien.Bloemen@dmc.amcnetworks.com> wrote:
Hi Johan,
Could you check if you see the following in you dmesg or message log file?
[1123306.014288] ------------[ cut here ]------------ [1123306.014302] WARNING: at net/core/dev.c:2189 skb_warn_bad_offload+0xcd/0xda() [1123306.014306] : caps=(0x0000000200004849, 0x0000000000000000) len=330 data_len=276 gso_size=276 gso_type=1 ip_summed=1 [1123306.014308] Modules linked in: vhost_net macvtap macvlan ip6table_filter ip6_tables iptable_filter ip_tables ebt_arp ebtable_nat ebtables tun scsi_transport_iscsi iTCO_wdt iTCO_vendor_support dm_service_time intel_powerclamp coretemp intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd pcspkr sb_edac edac_core i2c_i801 lpc_ich mfd_core mei_me mei wmi ioatdma shpchp ipmi_devintf ipmi_si ipmi_msghandler acpi_power_meter acpi_pad 8021q garp mrp bridge stp llc bonding dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ast syscopyarea sysfillrect sysimgblt drm_kms_helper ttm crc32c_intel igb drm ahci ixgbe i2c_algo_bit libahci libata mdio i2c_core ptp megaraid_sas pps_core dca dm_mirror dm_region_hash dm_log dm_mod [1123306.014360] CPU: 30 PID: 0 Comm: swapper/30 Tainted: G W -------------- 3.10.0-229.1.2.el7.x86_64 #1 [1123306.014362] Hardware name: Supermicro SYS-2028TP-HC1TR/X10DRT-PT, BIOS 1.1 08/03/2015 [1123306.014364] ffff881fffc439a8 5326fb90ad1041ea ffff881fffc43960 ffffffff81604afa [1123306.014371] ffff881fffc43998 ffffffff8106e34b ffff881fcebb0500 ffff881fce88c000 [1123306.014376] 0000000000000001 0000000000000001 ffff881fcebb0500 ffff881fffc43a00 [1123306.014381] Call Trace: [1123306.014383] <IRQ> [<ffffffff81604afa>] dump_stack+0x19/0x1b [1123306.014396] [<ffffffff8106e34b>] warn_slowpath_common+0x6b/0xb0 [1123306.014399] [<ffffffff8106e3ec>] warn_slowpath_fmt+0x5c/0x80 [1123306.014405] [<ffffffff812db093>] ? ___ratelimit+0x93/0x100 [1123306.014409] [<ffffffff816076c3>] skb_warn_bad_offload+0xcd/0xda [1123306.014425] [<ffffffff814fdeb9>] __skb_gso_segment+0x79/0xb0 [1123306.014429] [<ffffffff814fe1c2>] dev_hard_start_xmit+0x1a2/0x580 [1123306.014438] [<ffffffffa0168790>] ? deliver_clone+0x50/0x50 [bridge] [1123306.014443] [<ffffffff8151df1e>] sch_direct_xmit+0xee/0x1c0 [1123306.014447] [<ffffffff814fe798>] dev_queue_xmit+0x1f8/0x4a0 [1123306.014453] [<ffffffffa016880b>] br_dev_queue_push_xmit+0x7b/0xc0 [bridge] [1123306.014458] [<ffffffffa0168a22>] br_forward_finish+0x22/0x60 [bridge] [1123306.014464] [<ffffffffa0168ae0>] __br_forward+0x80/0xf0 [bridge] [1123306.014469] [<ffffffffa0168ebb>] br_forward+0x8b/0xa0 [bridge] [1123306.014476] [<ffffffffa0169e65>] br_handle_frame_finish+0x175/0x410 [bridge] [1123306.014481] [<ffffffffa016a275>] br_handle_frame+0x175/0x260 [bridge] [1123306.014485] [<ffffffff814fc112>] __netif_receive_skb_core+0x282/0x870 [1123306.014490] [<ffffffff8101b589>] ? read_tsc+0x9/0x10 [1123306.014493] [<ffffffff814fc718>] __netif_receive_skb+0x18/0x60 [1123306.014497] [<ffffffff814fc7a0>] netif_receive_skb+0x40/0xd0 [1123306.014500] [<ffffffff814fd2b0>] napi_gro_receive+0x80/0xb0 [1123306.014512] [<ffffffffa00cde2c>] ixgbe_clean_rx_irq+0x7ac/0xb30 [ixgbe] [1123306.014519] [<ffffffffa00cf07b>] ixgbe_poll+0x4bb/0x930 [ixgbe] [1123306.014524] [<ffffffff814fcb62>] net_rx_action+0x152/0x240 [1123306.014528] [<ffffffff81077bf7>] __do_softirq+0xf7/0x290 [1123306.014533] [<ffffffff8161635c>] call_softirq+0x1c/0x30 [1123306.014539] [<ffffffff81015de5>] do_softirq+0x55/0x90 [1123306.014543] [<ffffffff81077f95>] irq_exit+0x115/0x120 [1123306.014546] [<ffffffff81616ef8>] do_IRQ+0x58/0xf0 [1123306.014551] [<ffffffff8160c0ed>] common_interrupt+0x6d/0x6d [1123306.014553] <EOI> [<ffffffff814aa6d2>] ? cpuidle_enter_state+0x52/0xc0 [1123306.014561] [<ffffffff814aa6c8>] ? cpuidle_enter_state+0x48/0xc0 [1123306.014565] [<ffffffff814aa805>] cpuidle_idle_call+0xc5/0x200 [1123306.014569] [<ffffffff8101d21e>] arch_cpu_idle+0xe/0x30 [1123306.014574] [<ffffffff810c6945>] cpu_startup_entry+0xf5/0x290 [1123306.014580] [<ffffffff810423ca>] start_secondary+0x1ba/0x230 [1123306.014582] ---[ end trace 4d5a1bc838e1fcc0 ]---
If so, then could you try the following:
ethtool -K <nic name> lro off
Do this for all the 10G intel nics and check if the problems still exists
*Kind regards,*
*Jurriën Bloemen*
On 17-03-16 09:49, Johan Kooijman wrote:
Hi all,
Since we upgraded to the latest ovirt node running 7.2, we're seeing that nodes become unavailable after a while. It's running fine, with a couple of VM's on it, untill it becomes non responsive. At that moment it doesn't even respond to ICMP. It'll come back by itself after a while, but oVirt fences the machine before that time and restarts VM's elsewhere.
Engine tells me this message:
VDSM host09 command failed: Message timeout which can be caused by communication issues
Is anyone else experiencing these issues with ixgbe drivers? I'm running on Intel X540-AT2 cards.
-- Met vriendelijke groeten / With kind regards, Johan Kooijman
_______________________________________________ Users mailing listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users
This message (including any attachments) may contain information that is privileged or confidential. If you are not the intended recipient, please notify the sender and delete this email immediately from your systems and destroy all copies of it. You may not, directly or indirectly, use, disclose, distribute, print or copy this email or any part of it if you are not the intended recipient
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Met vriendelijke groeten / With kind regards, Johan Kooijman
participants (7)
-
Bloemen, Jurriën
-
Douglas Schilling Landgraf
-
Jeff Spahr
-
Johan Kooijman
-
Nir Soffer
-
Piotr Kliczewski
-
Sigbjorn Lie