
Hi, Every Monday and Wednesday morning there are gluster connectivity timeouts but all checks of the network and network configs are ok. Description of problem: The following entries were found in the engine.log following VMs becoming unresponsive and hosts fencing. This issue has been causing issues since the beginning of September and no amount of reading logs is helping. This issue occurs every Wednesday morning at exactly the same time. WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-974045) [] domain 'bc482086-598b-46b1-9189-0146fa03447c:pltfm_data03' in problem 'PROBLEMATIC'. vds: 'bdtpltfmovt02' WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-974069) [] domain 'bf807836-b64e-4913-ab41-cfe04ca9abab:pltfm_data01' in problem 'PROBLEMATIC'. vds: 'bdtpltfmovt02' WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-974121) [] domain 'bc482086-598b-46b1-9189-0146fa03447c:pltfm_data03' in problem 'PROBLEMATIC'. vds: 'bdtpltfmovt03' INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM '00a082de-c827-4c97-9846-ec32d1ddbfa6'(bdtfmnpproddb03) moved from 'Up' --> 'NotResponding' INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM 'f5457f04-054e-4684-9702-40ed4a3e4bdb'(bdtk8shaproxy02) moved from 'Up' --> 'NotResponding' INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM 'bea85b27-18e7-4936-9871-cdb987baebdd'(bdtdepjump) moved from 'Up' --> 'NotResponding' INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM 'ba1d4fe2-97e7-491a-9485-8319281e7784'(bdtcmgmtnfs01) moved from 'Up' --> 'NotResponding' INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM '9e253848-7153-43e8-8126-dba2d7f2d214'(bdtdepnfs01) moved from 'Up' --> 'NotResponding' INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM '1e58d106-4b65-4296-8d11-2142abb7808e'(bdtionjump) moved from 'Up' --> 'NotResponding' VDSM Log from one of the Gluster Peers: [2020-10-05 03:03:25.038883] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-pltfm_data02-client-0: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected] [2020-10-07 05:37:39.582138] C [rpc-clnt-ping.c:162:rpc_clnt_ping_timer_expired] 0-pltfm_data02-client-0: server x.x.x.x:49153 has not responded in the last 30 seconds, disconnecting. [2020-10-07 05:37:39.583217] I [MSGID: 114018] [client.c:2288:client_rpc_notify] 0-pltfm_data02-client-0: disconnected from pltfm_data02-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2020-10-07 05:37:39.584213] E [rpc-clnt.c:346:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fe83aa96fbb] (--> /lib64/libgfrpc.so.0(+0xce11)[0x7fe83a85fe11] (--> /lib64/libgfrpc.so.0(+0xcf2e)[0x7fe83a85ff2e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fe83a861521] (--> /lib64/libgfrpc.so.0(+0xf0c8)[0x7fe83a8620c8] ))))) 0-pltfm_data02-client-0: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2020-10-07 05:37:09.003907 (xid=0x7e6a8c) Current Version: 4.3.4.3-1.el7 - Although we are keen to upgrade, we need stability for this production environment before doing so. This Data Center has 2 x 3 Node clusters (Admin & Platform) which each have a 3 Replica Gluster configuration which is not managed by the self hosted ovirt engine. Any assistance is appreciated. Regards Shimme

Every Monday and Wednesday morning there are gluster connectivity timeouts >but all checks of the network and network configs are ok.
Based on this one I make the following conclusions: 1. Issue is reoccuring 2. You most probably have a network issue Have you checked the following: - are there any ping timeouts between fuse clients and gluster nodes - Have you tried to disable fencing and check the logs after the issue reoccurs - Are you sharing Blackup and Prod networks ? Is it possible some backup/other production load in your environment to "black-out" your oVirt ? - Have you check the gluster cluster's logs for anything meaningful ? Best Regards, Strahil Nikolov

Thanks Strahil. I have found between 1 & 4 Gluster peer rpc-clnt-ping timer expired messages in the rhev-data-center-mnt-glusterSD-hostname-strg:_pltfm_data01.log on the storage network IP. Of the 6 Hosts only 1 does not have these timeouts. Fencing has been disabled but can you identify which logs are key to identifying the cause please. It's a bonded (bond1) 10GB ovirt-mgmt logical network and Prod VM VLAN interface AND a bonded (bond2) 10GB Gluster storage network. Dropped packets are seen incrementing in the vdsm.log but neither ethtool -S or kernel logs are showing dropped packets. I am wondering if they are being dropped due to the ring buffers being small. Kind Regards Shimme ________________________________ From: Strahil Nikolov <hunter86_bg@yahoo.com> Sent: Thursday 8 October 2020 20:40 To: users@ovirt.org <users@ovirt.org>; Simon Scott <simon@justconnect.ie> Subject: Re: [ovirt-users] Gluster volume not responding
Every Monday and Wednesday morning there are gluster connectivity timeouts >but all checks of the network and network configs are ok.
Based on this one I make the following conclusions: 1. Issue is reoccuring 2. You most probably have a network issue Have you checked the following: - are there any ping timeouts between fuse clients and gluster nodes - Have you tried to disable fencing and check the logs after the issue reoccurs - Are you sharing Blackup and Prod networks ? Is it possible some backup/other production load in your environment to "black-out" your oVirt ? - Have you check the gluster cluster's logs for anything meaningful ? Best Regards, Strahil Nikolov

I have seen many "checks" that are "OK"... Have you checked that backups are not used over the same network ? I would disable the power management (fencing) ,so I can find out what has happened to the systems. Best Regards, Strahil Nikolov В четвъртък, 8 октомври 2020 г., 22:43:34 Гринуич+3, Strahil Nikolov via Users <users@ovirt.org> написа:
Every Monday and Wednesday morning there are gluster connectivity timeouts >but all checks of the network and network configs are ok.
Based on this one I make the following conclusions: 1. Issue is reoccuring 2. You most probably have a network issue Have you checked the following: - are there any ping timeouts between fuse clients and gluster nodes - Have you tried to disable fencing and check the logs after the issue reoccurs - Are you sharing Blackup and Prod networks ? Is it possible some backup/other production load in your environment to "black-out" your oVirt ? - Have you check the gluster cluster's logs for anything meaningful ? Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/TIDHSP34LVYUID...
participants (3)
-
Simon Scott
-
simon@justconnect.ie
-
Strahil Nikolov