Hi,
Every Monday and Wednesday morning there are gluster connectivity timeouts but all checks
of the network and network configs are ok.
Description of problem:
The following entries were found in the engine.log following VMs becoming unresponsive and
hosts fencing. This issue has been causing issues since the beginning of September and no
amount of reading logs is helping. This issue occurs every Wednesday morning at exactly
the same time.
WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy]
(EE-ManagedThreadFactory-engine-Thread-974045) [] domain
'bc482086-598b-46b1-9189-0146fa03447c:pltfm_data03' in problem
'PROBLEMATIC'. vds: 'bdtpltfmovt02'
WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy]
(EE-ManagedThreadFactory-engine-Thread-974069) [] domain
'bf807836-b64e-4913-ab41-cfe04ca9abab:pltfm_data01' in problem
'PROBLEMATIC'. vds: 'bdtpltfmovt02'
WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy]
(EE-ManagedThreadFactory-engine-Thread-974121) [] domain
'bc482086-598b-46b1-9189-0146fa03447c:pltfm_data03' in problem
'PROBLEMATIC'. vds: 'bdtpltfmovt03'
INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM
'00a082de-c827-4c97-9846-ec32d1ddbfa6'(bdtfmnpproddb03) moved from 'Up'
--> 'NotResponding'
INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM
'f5457f04-054e-4684-9702-40ed4a3e4bdb'(bdtk8shaproxy02) moved from 'Up'
--> 'NotResponding'
INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM
'bea85b27-18e7-4936-9871-cdb987baebdd'(bdtdepjump) moved from 'Up' -->
'NotResponding'
INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM
'ba1d4fe2-97e7-491a-9485-8319281e7784'(bdtcmgmtnfs01) moved from 'Up'
--> 'NotResponding'
INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM
'9e253848-7153-43e8-8126-dba2d7f2d214'(bdtdepnfs01) moved from 'Up' -->
'NotResponding'
INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(EE-ManagedThreadFactory-engineScheduled-Thread-78) [] VM
'1e58d106-4b65-4296-8d11-2142abb7808e'(bdtionjump) moved from 'Up' -->
'NotResponding'
VDSM Log from one of the Gluster Peers:
[2020-10-05 03:03:25.038883] W [MSGID: 114031]
[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-pltfm_data02-client-0: remote operation
failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
connected]
[2020-10-07 05:37:39.582138] C [rpc-clnt-ping.c:162:rpc_clnt_ping_timer_expired]
0-pltfm_data02-client-0: server x.x.x.x:49153 has not responded in the last 30 seconds,
disconnecting.
[2020-10-07 05:37:39.583217] I [MSGID: 114018] [client.c:2288:client_rpc_notify]
0-pltfm_data02-client-0: disconnected from pltfm_data02-client-0. Client process will keep
trying to connect to glusterd until brick's port is available
[2020-10-07 05:37:39.584213] E [rpc-clnt.c:346:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fe83aa96fbb] (-->
/lib64/libgfrpc.so.0(+0xce11)[0x7fe83a85fe11] (-->
/lib64/libgfrpc.so.0(+0xcf2e)[0x7fe83a85ff2e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7fe83a861521] (-->
/lib64/libgfrpc.so.0(+0xf0c8)[0x7fe83a8620c8] ))))) 0-pltfm_data02-client-0: forced
unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2020-10-07 05:37:09.003907
(xid=0x7e6a8c)
Current Version: 4.3.4.3-1.el7 - Although we are keen to upgrade, we need stability for
this production environment before doing so.
This Data Center has 2 x 3 Node clusters (Admin & Platform) which each have a 3
Replica Gluster configuration which is not managed by the self hosted ovirt engine.
Any assistance is appreciated.
Regards
Shimme