[ovirt-users] Host down/activation loop

Jamie Lawrence jlawrence at squaretrade.com
Wed Mar 21 23:39:38 UTC 2018


Hello,

Have an issue that feels sanlock related, but I can't get sorted with our installation. This is 4.2.1, hosted engine. One of our hosts is stuck in a loop. It:

 - gets a VDSM GetStatsVDS timeout, is marked as down, 
 - throws a warning about not being fenced (because that's not enabled yet, because of this problem).
 - and is set up Up about a minute later.

This repeats every 4 minutes and 20 seconds.

The hosted engine is running on the host that is stuck like this, and it doesn't appear to get in the way of creating new VMs or other operations, but obviously I can't use fencing, which is a big part of the point of running Ovirt in the first place.

I tried setting global maintenance and running hosted-engine --reinitialize-lockspace, which (a) took nearly exactly 2 minutes to run, making me think something timed out, (b) exited with rc 0, and (c) didn't fix the problem.

Anyone have an idea of how to fix this?

-j



- - details - -

I still can't quite figure out how to interpret what sanlock says, but  the -1s look like wrongness.

[sc5-ovirt-1]# sanlock client status
daemon bedae69e-03cc-49f8-88f4-9674a85a3185.sc5-ovirt-
p -1 helper
p -1 listener
p 122268 HostedEngine
p -1 status
s 1aabcd3a-3fd3-4902-b92e-17beaf8fe3fd:1:/rhev/data-center/mnt/glusterSD/172.16.0.151\:_sc5-images/1aabcd3a-3fd3-4902-b92e-17beaf8fe3fd/dom_md/ids:0
s b41eb20a-eafb-481b-9a50-a135cf42b15e:1:/rhev/data-center/mnt/glusterSD/sc5-gluster-10g-1\:_sc5-ovirt__engine/b41eb20a-eafb-481b-9a50-a135cf42b15e/dom_md/ids:0
r b41eb20a-eafb-481b-9a50-a135cf42b15e:8f0c9f7a-ae6a-476e-b6f3-a830dcb79e87:/rhev/data-center/mnt/glusterSD/172.16.0.153\:_sc5-ovirt__engine/b41eb20a-eafb-481b-9a50-a135cf42b15e/images/a9d01d59-f146-47e5-b514-d10f8867678e/8f0c9f7a-ae6a-476e-b6f3-a830dcb79e87.lease:0:5 p 122268


engine.log:

2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM sc5-ovirt-1 command GetStatsVDS failed: Message timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Command 'GetStatsVDSCommand(HostName = sc5-ovirt-1, VdsIdAndVdsVDSCommandParametersBase:{hostId='be3517e0-f79d-464c-8169-f786d13ac287', vds='Host[sc5-ovirt-1,be3517e0-f79d-464c-8169-f786d13ac287]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failed getting vds stats, host='sc5-ovirt-1'(be3517e0-f79d-464c-8169-f786d13ac287): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failure to refresh host 'sc5-ovirt-1' runtime info: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failed to refresh VDS, network error, continuing, vds='sc5-ovirt-1'(be3517e0-f79d-464c-8169-f786d13ac287): VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-102682) [] Host 'sc5-ovirt-1' is not responding.
2018-03-21 16:09:26,088-07 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-102682) [] EVENT_ID: VDS_HOST_NOT_RESPONDING(9,027), Host sc5-ovirt-1 is not responding. Host cannot be fenced automatically because power management for the host is disabled.
2018-03-21 16:09:27,070-07 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to sc5-ovirt-1/10.181.26.129
2018-03-21 16:09:27,918-07 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [493fb316] START, GlusterServersListVDSCommand(HostName = sc5-gluster-2, VdsIdVDSCommandParametersBase:{hostId='797cbf42-6553-4a75-b8b1-93b2adbbc0db'}), log id: 6afccc01
2018-03-21 16:09:28,579-07 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [493fb316] FINISH, GlusterServersListVDSCommand, return: [192.168.122.1/24:CONNECTED, sc5-gluster-3:CONNECTED, sc5-gluster-10g-1:CONNECTED], log id: 6afccc01
2018-03-21 16:09:28,606-07 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [493fb316] START, GlusterVolumesListVDSCommand(HostName = sc5-gluster-2, GlusterVolumesListVDSParameters:{hostId='797cbf42-6553-4a75-b8b1-93b2adbbc0db'}), log id: 44e90100
2018-03-21 16:09:29,015-07 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler4) [493fb316] FINISH, GlusterVolumesListVDSCommand, return: {6fe949b5-894a-4843-b3e4-af81545574dc=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity at 140a4a60, bc29ba89-8fc0-494d-9fe5-bc7b34396b65=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity at 29637467}, log id: 44e90100
2018-03-21 16:09:29,686-07 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [] START, GetHardwareInfoVDSCommand(HostName = sc5-ovirt-1, VdsIdAndVdsVDSCommandParametersBase:{hostId='be3517e0-f79d-464c-8169-f786d13ac287', vds='Host[sc5-ovirt-1,be3517e0-f79d-464c-8169-f786d13ac287]'}), log id: 6b1cb74b
2018-03-21 16:09:29,692-07 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [] FINISH, GetHardwareInfoVDSCommand, log id: 6b1cb74b
2018-03-21 16:09:29,900-07 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [576fddcc] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: be3517e0-f79d-464c-8169-f786d13ac287 Type: VDS
2018-03-21 16:09:29,944-07 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [26c5f844] Running command: InitVdsOnUpCommand internal: true. Entities affected :  ID: c4e2ca40-1e72-11e8-beac-00163e0994d8 Type: StoragePool
2018-03-21 16:09:29,977-07 INFO  [org.ovirt.engine.core.bll.storage.pool.ConnectHostToStoragePoolServersCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [41e6da49] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: c4e2ca40-1e72-11e8-beac-00163e0994d8 Type: StoragePool
2018-03-21 16:09:30,002-07 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [41e6da49] START, ConnectStorageServerVDSCommand(HostName = sc5-ovirt-1, StorageServerConnectionManagementVDSParameters:{hostId='be3517e0-f79d-464c-8169-f786d13ac287', storagePoolId='c4e2ca40-1e72-11e8-beac-00163e0994d8', storageType='GLUSTERFS', connectionList='[StorageServerConnections:{id='0e2e93f1-3904-4d70-82aa-16bcc83ea314', connection='172.16.0.153:/sc5-ovirt_engine', iqn='null', vfsType='glusterfs', mountOptions='backup-volfile-servers=172.16.0.152:172.16.0.151', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='26c9dbd8-f550-4b7a-9f84-3e905f1a00db', connection='172.16.0.151:/sc5-images', iqn='null', vfsType='glusterfs', mountOptions='backup-volfile-servers=172.16.0.152:172.16.0.153', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: acd504a
2018-03-21 16:09:30,099-07 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [41e6da49] FINISH, ConnectStorageServerVDSCommand, return: {26c9dbd8-f550-4b7a-9f84-3e905f1a00db=0, 0e2e93f1-3904-4d70-82aa-16bcc83ea314=0}, log id: acd504a
2018-03-21 16:09:30,107-07 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [41e6da49] START, ConnectStorageServerVDSCommand(HostName = sc5-ovirt-1, StorageServerConnectionManagementVDSParameters:{hostId='be3517e0-f79d-464c-8169-f786d13ac287', storagePoolId='c4e2ca40-1e72-11e8-beac-00163e0994d8', storageType='NFS', connectionList='[StorageServerConnections:{id='2239cb49-a8bb-49ee-9a5a-90d72c4602d0', connection='sc5-archive-10g-1:/var/ovirt/ovirt_iso_new', iqn='null', vfsType='null', mountOptions='null', nfsVersion='AUTO', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 35528d0f
2018-03-21 16:09:30,099-07 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [41e6da49] FINISH, ConnectStorageServerVDSCommand, return: {26c9dbd8-f550-4b7a-9f84-3e905f1a00db=0, 0e2e93f1-3904-4d70-82aa-16bcc83ea314=0}, log id: acd504a


More information about the Users mailing list