Hi,
I have a hyperconverged ovirt 4.3.10 installation. One host is unresponsive. This has
happened in the past but most often resolved itself within a few minutes. This time it is
a different beast.
The host does not come back. Putting the host in maintenance is not possible because
"Host is unresponsive but has running virtual machines". Putting the host into
local maintenance on the local CLI does not help. Global maintenance does not have any
effect either. The host has been rebooted several times both locally and via the web
interface on the engine, which worked but did not resolve the issue.
Logs:
/var/log/ovirt-engine/engine.log
2022-06-07 06:40:17,606+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-34)
[26f286ba-58d0-44be-96e0-f874057fe603] EVENT_ID: GENERIC_ERROR_MESSAGE(14,001), Cannot
switch Host to Maintenance mode.
Host still has running VMs on it and is in Non Responsive state.
2022-06-07 06:40:17,606+02 WARN
[org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-34)
[26f286ba-58d0-44be-96e0-f874057fe603] Validation of action
'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons:
VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_VDS_IS_NOT_RESPONDING_WITH_VMS,$HostNotResponding
ovirt3.os-s.de
2022-06-07 06:41:23,974+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-33) [] EVENT_ID:
VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ovirt3.os-s.de command Get Host Capabilities
failed: Message timeout which can be caused by communication issues
2022-06-07 06:41:23,974+02 ERROR
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
(EE-ManagedThreadFactory-engineScheduled-Thread-33) [] Unable to RefreshCapabilities:
VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can
be caused by communication issues
2022-06-07 06:44:47,033+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-4) [] EVENT_ID:
VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ovirt3.os-s.de command Get Host Capabilities
failed: Message timeout which can be caused by communication issues
2022-06-07 06:44:47,034+02 ERROR
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
(EE-ManagedThreadFactory-engineScheduled-Thread-4) [] Unable to RefreshCapabilities:
VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can
be caused by communication issues
vdsmd status:
Jun 06 08:51:09 ovirt3.os-s.de vdsm[3567]: WARN Worker blocked: <Worker name=jsonrpc/1
running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0',
'method': u'Host.getStats', 'id':
u'ce7ab5d1-3034-4b5a-91c5-9446e17341a2'} at 0x7fe1646563d0> timeout=60,
duration=120.00 at 0x7fe1646562d0> task#=15 at 0x7fe165f47e50>, traceback:
File:
"/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File:
"/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File:
"/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File:
"/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 260, in run
ret = func(*args, **kwargs)
File:
"/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File:
"/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
task()
File:
"/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
self._callable()
File:
"/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__
self._handler(self._ctx, self._req)
File:
"/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in
_serveRequest
response = self._handle_request(req, ctx)
File:
"/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in
_handle_request
res = method(**params)
File:
"/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in
_dynamicMethod
result = fn(*methodArgs)
File: "<string>", line 2, in
getStats
File:
"/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
ret = func(*args, **kwargs)
File:
"/usr/lib/python2.7/site-packages/vdsm/API.py", line 1435, in getStats
multipath=True)}
File:
"/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 79, in get_stats
ret['haStats'] = _getHaInfo()
File:
"/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo
stats = instance.get_all_stats(timeout=5)
File:
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line
94, in get_all_stats
stats = broker.get_stats_from_storage()
File:
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 146, in get_stats_from_storage
result = self._proxy.get_stats()
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__
return self.__send(self.__name, args)
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request
verbose=self.__verbose
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
return self.single_request(host, handler,
request_body, verbose)
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1301, in single_request
self.send_content(h, request_body)
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1448, in send_content
connection.endheaders(request_body)
File:
"/usr/lib64/python2.7/httplib.py", line 1052, in endheaders
self._send_output(message_body)
File:
"/usr/lib64/python2.7/httplib.py", line 890, in _send_output
self.send(msg)
File:
"/usr/lib64/python2.7/httplib.py", line 852, in send
self.connect()
File:
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line
60, in connect
self.sock.connect(base64.b16decode(self.host))
File:
"/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
Jun 06 08:51:14 ovirt3.os-s.de vdsm[3567]: WARN Worker blocked: <Worker name=periodic/0
running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at
0x7fe16457ba90> at 0x7fe16457bad0> timeout=15, duration=120.00 at 0x7fe1644c3910>
task#=22 at 0x7fe1646e4550>, traceback:
File:
"/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File:
"/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File:
"/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File:
"/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 260, in run
ret = func(*args, **kwargs)
File:
"/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File:
"/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
task()
File:
"/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
self._callable()
File:
"/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 186, in __call__
self._func()
File:
"/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 481, in __call__
stats = hostapi.get_stats(self._cif,
self._samples.stats())
File:
"/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 79, in get_stats
ret['haStats'] = _getHaInfo()
File:
"/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo
stats = instance.get_all_stats(timeout=5)
File:
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line
94, in get_all_stats
stats = broker.get_stats_from_storage()
File:
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 146, in get_stats_from_storage
result = self._proxy.get_stats()
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__
return self.__send(self.__name, args)
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request
verbose=self.__verbose
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
return self.single_request(host, handler,
request_body, verbose)
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1301, in single_request
self.send_content(h, request_body)
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1448, in send_content
connection.endheaders(request_body)
File:
"/usr/lib64/python2.7/httplib.py", line 1052, in endheaders
self._send_output(message_body)
File:
"/usr/lib64/python2.7/httplib.py", line 890, in _send_output
self.send(msg)
File:
"/usr/lib64/python2.7/httplib.py", line 852, in send
self.connect()
File:
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line
60, in connect
self.sock.connect(base64.b16decode(self.host))
File:
"/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
vdsmd log:
2022-06-07 05:02:13,717+0000 INFO (jsonrpc/3) [vdsm.api] FINISH multipath_health
return={} from=::1,54704, task_id=169b447f-fd52-4a6f-9277-eff9870ffb81 (api:54)
2022-06-07 05:02:13,725+0000 INFO (jsonrpc/3) [api.host] FINISH getStats
return={'status': {'message': 'Done', 'code': 0},
'info': {'cpuStatistics': {'20': {'cpuUser':
'0.07', 'nodeIndex': 1, 'cpuSys': '0.00',
'cpuIdle': '99.93'}, '21': {'cpuUser': '0.00',
'nodeIndex': 1, 'cpuSys': '0.00', 'cpuIdle':
'100.00'}, '22': {'cpuUser': '0.00', 'nodeIndex':
1, 'cpuSys': '0.00', 'cpuIdle': '100.00'}, '23':
{'cpuUser': '0.00', 'nodeIndex': 1, 'cpuSys':
'0.07', 'cpuIdle': '99.93'}, '1': {'cpuUser':
'0.87', 'nodeIndex': 0, 'cpuSys': '0.60',
'cpuIdle': '98.53'}, '0': {'cpuUser': '0.40',
'nodeIndex': 0, 'cpuSys': '0.27', 'cpuIdle':
'99.33'}, '3': {'cpuUser': '0.27', 'nodeIndex': 0,
'cpuSys': '0.33', 'cpuIdle': '99.40'}, '2':
{'cpuUser': '0.53', 'nodeIndex': 0, 'cpuSys':
'0.40', 'cpuIdle': '99.07'}, '5': {'cpuUser':
'0.33', 'nodeIndex': 0, 'cpuSys': '0.33',
'cpuIdle': '99.34'}, '4': {'cpuUser': '0.33',
'nodeIndex': 0, 'cpuSys': '0.40', 'cpuIdle':
'99.27'}, '7': {'cpuUser': '0.13', 'nodeIndex': 1,
'cpuSys': '0.07', 'cpuIdle': '99.80'}, '6':
{'cpuUser': '0.13', 'nodeIndex': 1, 'cpuSys':
'0.07', 'cpuIdle': '99.80'}, '9': {'cpuUser':
'0.13', 'nodeIndex': 1, 'cpuSys': '0.13',
'cpuIdle': '99.74'}, '8': {'cpuUser': '0.07',
'nodeIndex': 1, 'cpuSys': '0.13', 'cpuIdle':
'99.80'}, '11': {'cpuUser': '0.20', 'nodeIndex':
1, 'cpuSys': '0.13', 'cpuIdle': '99.67'}, '10':
{'cpuUser': '0.20', 'nodeIndex': 1, 'cpuSys':
'0.27', 'cpuIdle': '99.53'}, '13': {'cpuUser':
'0.80', 'nodeIndex': 0, 'cpuSys': '0.47',
'cpuIdle': '98.73'}, '12': {'cpuUser': '0.00',
'nodeIndex': 0, 'cpuSys': '0.00', 'cpuIdle':
'100.00'}, '15': {'cpuUser': '0.07', 'nodeIndex':
0, 'cpuSys': '0.00', 'cpuIdle': '99.93'}, '14':
{'cpuUser': '0.20', 'nodeIndex': 0, 'cpuSys':
'0.27', 'cpuIdle': '99.53'}, '17': {'cpuUser':
'0.07', 'nodeIndex': 0, 'cpuSys': '0.00',
'cpuIdle': '99.93'}, '16': {'cpuUser': '0.07',
'nodeIndex': 0, 'cpuSys': '0.00', 'cpuIdle':
'99.93'}, '19': {'cpuUser': '1.46', 'nodeIndex':
1, 'cpuSys': '0.33', 'cpuIdle'
: '98.21'}, '18': {'cpuUser': '0.27',
'nodeIndex': 1, 'cpuSys': '0.07', 'cpuIdle':
'99.66'}}, 'numaNodeMemFree': {'1': {'memPercent': 18,
'memFree': '52935'}, '0': {'memPercent': 17,
'memFree': '53551'}}, 'memShared': 0, 'haScore': 0,
'thpState': 'always', 'ksmMergeAcrossNodes': True,
'vmCount': 3, 'memUsed': '3', 'storageDomains': {},
'incomingVmMigrations': 0, 'network': {'bond0':
{'rxErrors': '0', 'txErrors': '0', 'speed':
'2000', 'rxDropped': '4', 'name': 'bond0',
'tx': '84766340728', 'txDropped': '0', 'duplex':
'full', 'sampleTime': 1654578133.717023, 'rx':
'102755561238', 'state': 'up'}, 'eno50':
{'rxErrors': '0', 'txErrors': '0', 'speed':
'1000', 'rxDropped': '0', 'name': 'eno50',
'tx': '0', 'txDropped': '0', 'duplex':
'unknown', 'sampleTime': 1654578133.717023, 'rx': '0',
'state': 'down'}, ';vdsmdummy;': {'rxErrors': '0',
'txErrors': '0', 'speed': '1000', 'rxDropped':
'0', 'name': ';vdsmdummy;', 'tx': '0',
'txDropped': '0', 'duplex': 'unknown',
'sampleTime': 1654578133.71702
3, 'rx': '0', 'state': 'down'}, 'ovirtmgmt':
{'rxErrors': '0', 'txErrors': '0', 'speed':
'1000', 'rxDropped': '122', 'name': 'ovirtmgmt',
'tx': '81789466633', 'txDropped': '0', 'duplex':
'unknown', 'sampleTime': 1654578133.717023, 'rx':
'99028291927', 'state': 'up'}, 'lo': {'rxErrors':
'0', 'txErrors': '0', 'speed': '1000',
'rxDropped': '0', 'name': 'lo', 'tx':
'10214215104', 'txDropped': '0', 'duplex':
'unknown', 'sampleTime': 1654578133.717023, 'rx':
'10214215104', 'state': 'up'}, 'ovs-system':
{'rxErrors': '0', 'txErrors': '0', 'speed':
'1000', 'rxDropped': '0', 'name': 'ovs-system',
'tx': '0', 'txDropped': '0', 'duplex':
'unknown', 'sampleTime': 1654578133.717023, 'rx': '0',
'state': 'down'}, 'eno49': {'rxErrors': '0',
'txErrors': '0', 'speed': '1000', 'rxDropped':
'0', 'name': 'eno49', 'tx': '0',
'txDropped': '0', 'duplex': 'unknown',
'sampleTime': 1654578133.717023, 'rx': '0', 'state':
'down'}, 'br-int': {'rxErrors': '0', 'txErrors':
'0', 'speed': '1000', 'rxDropped': '0', 'na
me': 'br-int', 'tx': '0', 'txDropped': '0',
'duplex': 'unknown', 'sampleTime': 1654578133.717023,
'rx': '0', 'state': 'down'}, 'eno1':
{'rxErrors': '0', 'txErrors': '0', 'speed':
'1000', 'rxDropped': '0', 'name': 'eno1',
'tx': '0', 'txDropped': '0', 'duplex':
'unknown', 'sampleTime': 1654578133.717023, 'rx': '0',
'state': 'down'}, 'eno2': {'rxErrors': '0',
'txErrors': '0', 'speed': '1000', 'rxDropped':
'0', 'name': 'eno2', 'tx': '0',
'txDropped': '0', 'duplex': 'unknown',
'sampleTime': 1654578133.717023, 'rx': '0', 'state':
'down'}, 'eno3': {'rxErrors': '0', 'txErrors':
'0', 'speed': '1000', 'rxDropped': '1',
'name': 'eno3', 'tx': '42493183527', 'txDropped':
'0', 'duplex': 'full', 'sampleTime': 1654578133.717023,
'rx': '20793574', 'state': 'up'}, 'eno4':
{'rxErrors': '0', 'txErrors': '0', 'speed':
'1000', 'rxDropped': '1', 'name': 'eno4',
'tx': '42273157201', 'txDropped': '0', 'duplex':
'full', 'sampleTime': 1654578133.717023, 'rx':
'102734767664', 'state': 'up'}, 'genev_sys_6081':
{'rxErrors': '0', 'txErrors': '8', 'speed':
'1000', 'rxDropped': '0', 'name':
'genev_sys_6081', 'tx': '0', 'txDropped': '0',
'duplex': 'unknown', 'sampleTime': 1654578133.717023,
'rx': '0', 'state': 'up'}}, 'txDropped':
'128', 'anonHugePages': '326', 'ksmPages': 100,
'elapsedTime': '391.12', 'cpuLoad': '0.16',
'cpuSys': '0.18', 'diskStats': {'/var/log':
{'free': '6478'}, '/var/run/vdsm/': {'free':
'64294'}, '/tmp': {'free': '906'}},
'cpuUserVdsmd': '0.47', 'netConfigDirty': 'False',
'memCommitted': 28672, 'ksmState': False, 'vmMigrating': 0,
'ksmCpu': 0, 'memAvailable': 126975, 'bootTime':
'1654505219', 'haStats': {'active': True, 'configured':
True, 'score': 0, 'localMaintenance': True, 'globalMaintenance':
False}, 'momStatus': 'active', 'multipathHealth': {},
'rxDropped': '0', 'outgoingVmMigrations': 0, 'swapTotal':
32191, 'swapFree': 32191, 'hugepages': defaultdict(<type
'dict'>, {1048576: {'resv_hugepages': 0, 'free_hugepages': 0,
'nr_overcommit_hugepages': 0, 'surplus_hugepages
': 0, 'vm.free_hugepages': 0, 'nr_hugepages': 0,
'nr_hugepages_mempolicy': 0}, 2048: {'resv_hugepages': 0,
'free_hugepages': 0, 'nr_overcommit_hugepages': 0,
'surplus_hugepages': 0, 'vm.free_hugepages': 0, 'nr_hugepages': 0,
'nr_hugepages_mempolicy': 0}}), 'dateTime': '2022-06-07T05:02:13
GMT', 'cpuUser': '0.27', 'memFree': 126719, 'cpuIdle':
'99.55', 'vmActive': 0, 'v2vJobs': {}, 'cpuSysVdsmd':
'0.13'}} from=::1,54704 (api:54)
2022-06-07 05:02:13,726+0000 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call
Host.getStats succeeded in 0.03 seconds (__init__:312)
2022-06-07 05:02:13,973+0000 INFO (vmrecovery) [vdsm.api] START
getConnectedStoragePoolsList(options=None) from=internal,
task_id=65629949-c74d-4f66-846a-1e2c40be9157 (api:48)
2022-06-07 05:02:13,973+0000 INFO (vmrecovery) [vdsm.api] FINISH
getConnectedStoragePoolsList return={'poollist': []} from=internal,
task_id=65629949-c74d-4f66-846a-1e2c40be9157 (api:54)
2022-06-07 05:02:13,973+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool
to go up (clientIF:723)
2022-06-07 05:02:14,468+0000 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.ping2
succeeded in 0.00 seconds (__init__:312)
2022-06-07 05:02:14,471+0000 INFO (jsonrpc/5) [vdsm.api] START
repoStats(domains=[u'5d6276cc-08ab-47f6-81e7-4e64aac3d386']) from=::1,54704,
task_id=6918f715-bbf4-4c7f-af00-1764294ab665 (api:48)
2022-06-07 05:02:14,472+0000 INFO (jsonrpc/5) [vdsm.api] FINISH repoStats return={}
from=::1,54704, task_id=6918f715-bbf4-4c7f-af00-1764294ab665 (api:54)
2022-06-07 05:02:14,472+0000 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call
Host.getStorageRepoStats succeeded in 0.00 seconds (__init__:312)
Gluster volumes are all online.
systemctl status ovirt-ha-broker:
Jun 06 08:47:08 ovirt3.os-s.de systemd[1]: Started oVirt Hosted Engine High Availability
Communications Broker.
Jun 06 08:47:59 ovirt3.os-s.de ovirt-ha-broker[1524]: ovirt-ha-broker
mgmt_bridge.MgmtBridge ERROR Failed to getVdsStats: No 'network' in result
Any ideas?
Ralf