on engine.log I see:
At 2020-04-28 23:48:18,378-04 I see:SetVdsStatusVDSCommandParameters:{ hostId='b34db269-5351-4653-9a0c-90a9154cd687', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}
So, when test try to put host1 in local maintenance at 2020-04-28 23:59:51 it fails with:
Validation of action 'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons: VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_NO_ALTERNATE_HOST_FOR_HOSTED_ENGINE
vdsm on host0 shows a traceback
2020-04-28 23:43:04,944-0400 ERROR (jsonrpc/0) [vds] setKsmTune API call failed. (API:1660) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1657, in setKsmTune supervdsm.getProxy().ksmTune(tuningParams) File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 56, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda> **kwargs) File "<string>", line 2, in ksmTune File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result) IOError: [Errno 22] Invalid argument
which seems unrelated but maybe worth to be investigated by storage team. +Tal Nisan can you look into this?
More close to the failure on host0, I see:
2020-04-28 23:49:58,775-0400 ERROR (vm/b6ca2e94) [virt.vm] (vmId='b6ca2e94-df8b-48e9-b0ee-2bc0f939786a') The vm start process failed (vm:934) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 868, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2895, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1110, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: internal error: qemu unexpectedly closed the monitor: 2020-04-29T03:49:55.484660Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future 2020-04-29T03:49:55.582536Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-cfb2266f-5d47-4418-b30f-9c1d3fbf512c,id=ua-cfb2266f-5d47-4418-b30f-9c1d3fbf512c,bootindex=1,write-cache=on: Failed to get shared "write" lock Is another process using the image [/var/run/vdsm/storage/fc1a55d5-deb4-4423-be56-e7313645798b/cfb2266f-5d47-4418-b30f-9c1d3fbf512c/68d04a61-9f34-4a1b-8d6e-bca43a7b9339]? 2020-04-28 23:49:58,775-0400 INFO (vm/b6ca2e94) [virt.vm] (vmId='b6ca2e94-df8b-48e9-b0ee-2bc0f939786a') Changed state to Down: internal error: qemu unexpectedly closed the monitor: 2020-04-29T03:49:55.484660Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future 2020-04-29T03:49:55.582536Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-cfb2266f-5d47-4418-b30f-9c1d3fbf512c,id=ua-cfb2266f-5d47-4418-b30f-9c1d3fbf512c,bootindex=1,write-cache=on: Failed to get shared "write" lock Is another process using the image [/var/run/vdsm/storage/fc1a55d5-deb4-4423-be56-e7313645798b/cfb2266f-5d47-4418-b30f-9c1d3fbf512c/68d04a61-9f34-4a1b-8d6e-bca43a7b9339]? (code=1) (vm:1702) 2020-04-28 23:49:58,799-0400 INFO (vm/b6ca2e94) [virt.vm] (vmId='b6ca2e94-df8b-48e9-b0ee-2bc0f939786a') Stopping connection (guestagent:455) 2020-04-28 23:49:58,849-0400 INFO (jsonrpc/1) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:192.168.200.99,49938, vmId=b6ca2e94-df8b-48e9-b0ee-2bc0f939786a (api:48) 2020-04-28 23:49:58,851-0400 INFO (jsonrpc/1) [virt.vm] (vmId='b6ca2e94-df8b-48e9-b0ee-2bc0f939786a') Release VM resources (vm:5186) 2020-04-28 23:49:58,851-0400 WARN (jsonrpc/1) [virt.vm] (vmId='b6ca2e94-df8b-48e9-b0ee-2bc0f939786a') trying to set state to Powering down when already Down (vm:626) 2020-04-28 23:49:58,851-0400 INFO (jsonrpc/1) [virt.vm] (vmId='b6ca2e94-df8b-48e9-b0ee-2bc0f939786a') Stopping connection (guestagent:455)
+Ryan Barry can you check the qemu-kvm warning?
Help understanding why storage domain became unreachable is welcome.
Sandro Bonazzola
MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV