We have number of clusters connected to ovirt-engine. Some of these are single host
clusters (running ovirt-release43-4.3.5.2-1 on CentOS7) with local storage. Recently,
ovirt-engine started reporting one of these hosts NonResponsive, VMs were still running on
the host but ovirt seems unable to communicate with it, testing shows no issues connecting
engine -> host:vdsm and likewise the host can communicate with the engine on port 80
and 443.
The host in question cannot be managed via IPMI for power management but we are able to
perform an SSH reboot via the engine interface. We opted to login to the running virtual
machines, shut them down and issue the SSH reboot from the engine. The server changes to
rebooting status for some time and then reports NonResponsive state.
We are unable to maintenance the host or confirm host has been rebooted manually as we are
presented with the following
"Error while executing action: Cannot perform confirm 'Host has been
rebooted'. Another power management action is already in progress."
The VDSM logs on the host in question are continually showing:
2020-04-16 08:23:51,478+0000 INFO (vmrecovery) [vds] recovery: waiting for storage pool
to go up (clientIF:711)
2020-04-16 08:23:52,332+0000 INFO (jsonrpc/7) [vdsm.api] FINISH getStoragePoolInfo
error=Unknown pool id, pool not connected:
(u'6baea5dc-b049-47c2-a94f-5229c37c62d0',) from=::ffff:10.10.1.252,33680,
task_id=420249a4-55c0-436d-92c7-ea1286a0e287 (api:52)
2020-04-16 08:23:52,332+0000 ERROR (jsonrpc/7) [storage.TaskManager.Task]
(Task='420249a4-55c0-436d-92c7-ea1286a0e287') Unexpected error (task:875)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
_run
return fn(*args, **kargs)
File "<string>", line 2, in getStoragePoolInfo
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in
method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2550, in
getStoragePoolInfo
pool = self.getPool(spUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 351, in
getPool
raise se.StoragePoolUnknown(spUUID)
StoragePoolUnknown: Unknown pool id, pool not connected:
(u'6baea5dc-b049-47c2-a94f-5229c37c62d0',)
2020-04-16 08:23:52,333+0000 INFO (jsonrpc/7) [storage.TaskManager.Task]
(Task='420249a4-55c0-436d-92c7-ea1286a0e287') aborting: Task is aborted:
"Unknown pool id, pool not connected:
(u'6baea5dc-b049-47c2-a94f-5229c37c62d0',)" - code 309 (task:1181)
During this period, the following is observed in the engine logs:
2020-04-16 08:23:52,307Z ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] EVENT_ID:
VDS_BROKER_COMMAND_FAILURE(10,802), VDSM compute01.ovirt.local command SpmStatusVDS
failed: Message timeout which can be caused by communication issues
2020-04-16 08:23:52,307Z ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] Command
'SpmStatusVDSCommand(HostName = compute01.ovirt.local,
SpmStatusVDSCommandParameters:{hostId='67dc53da-d5ee-461e-87de-2ca6dd78637f',
storagePoolId='6baea5dc-b049-47c2-a94f-5229c37c62d0'})' execution failed:
VDSGenericException: VDSNetworkException: Message timeout which can be caused by
communication issues
2020-04-16 08:23:52,346Z ERROR
[org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand]
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] Failed in
'GetStoragePoolInfoVDS' method
2020-04-16 08:23:52,355Z ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-31) [] EVENT_ID:
IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command GetStoragePoolInfoVDS failed: Unknown
pool id, pool not connected: (u'6baea5dc-b049-47c2-a94f-5229c37c62d0',)
2020-04-16 08:23:52,356Z ERROR
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(EE-ManagedThreadFactory-engineScheduled-Thread-31) []
IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: Failed
to GetStoragePoolInfoVDS, error = Unknown pool id, pool not connected:
(u'6baea5dc-b049-47c2-a94f-5229c37c62d0',), code = 309
The metadata file for the local storage domain looks fine?
ALIGNMENT=1048576
BLOCK_SIZE=512
CLASS=Data
DESCRIPTION=compute01_local_storage
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=1
POOL_DESCRIPTION=compute01_local
POOL_DOMAINS=1cc26dea-688c-40cc-bda6-38b00054001e:Active
POOL_SPM_ID=-1
POOL_SPM_LVER=-1
POOL_UUID=6baea5dc-b049-47c2-a94f-5229c37c62d0
REMOTE_PATH=/mnt/ovirt_datastore
ROLE=Master
SDUUID=1cc26dea-688c-40cc-bda6-38b00054001e
TYPE=LOCALFS
VERSION=5
_SHA_CKSUM=24c85256b889d0b3384e7975c660f4a5cbb58d33
I would assume this has happened because ovirt was unable to power cycle the machine and
now can't confirm the SPM state? Normally in a case like this we would confirm the
host has been manually rebooted but we're unable to do that.
How can I clear the power management action that ovirt-engine thinks is in progress?