Hello
We have a cluster with 9 ovirt host and a storage cluster with 3 storage servers, we run
ovirt 4.3.6.6-1.
Storage is setup is replica 3 with arbiter.
There is another unmanaged gluster volume which is replica 3 based on local disks of ovirt
hosts and the engine and a few small pfsense vm reside on it.
Yesterday we experienced a hardware failure on one raid controler on one of the storage
servers, so the server stopped working.
On that particular server on it's local disks was a nfs storage domain used as ISO.
After the storage server failure multiple machines got stuck and in not responding state,
others kept working but we could not manage them from ovirt. many machines had iso mounted
so it may have to do with this.
https://bugzilla.redhat.com/show_bug.cgi?id=1337073
But machines with ISO mounted were not the only ones affected. Allmost every host on ovirt
clusterd switched from non responding to ok status, many machines were on unknown status
even if they did not have mounted iso or even resided in any storage domain of the
gluster cluster. I.E. pfsense machines residing on local drives as the engine. These
machines could not be Powered off from inside the engine. In any case I could not put the
iso domain on maintenance because isos were still mounted on varius virtual machines.
I'm wondering what caused the allmost total failure, was it the nfs domain that could
not be mounted ? And of course how could I prevent such a situation.
I'm sending you the logs from the critical time period from one of the hosts. Please
let me know if you need more debug info.
2019-10-17 06:01:43,039+0300 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call
Host.getStats succeeded in 0.25 seconds (__init__:312)
2019-10-17 06:01:46,638+0300 WARN (check/loop) [storage.check] Checker
u'/rhev/data-center/mnt/g2-car0133.gfs-int.uoc.gr:_nfs-exports_isos/ec67be6b-367f-4e41-a686-695ab50c0034/dom_md/metadata'
is blocked for 39
0.00 seconds (check:282)
2019-10-17 06:01:49,318+0300 WARN (itmap/0) [storage.scanDomains] Could not collect
metadata file for domain path
/rhev/data-center/mnt/g2-car0133.gfs-int.uoc.gr:_nfs-exports_isos (fileSD:913)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 902, in
collectMetaFiles
metaFiles = oop.getProcessPool(client_name).glob.glob(mdPattern)
File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line
110, in glob
return self._iop.glob(pattern)
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 561, in
glob
return self._sendCommand("glob", {"pattern": pattern},
self.timeout)
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 442, in
_sendCommand
raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
2019-10-17 06:01:49,319+0300 ERROR (monitor/ec67be6) [storage.Monitor] Error checking
domain ec67be6b-367f-4e41-a686-695ab50c0034 (monitor:425)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 406, in
_checkDomainStatus
self.domain.selftest()
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 48, in
__getattr__
return getattr(self.getRealDomain(), attrName)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in
getRealDomain
return self._cache._realProduce(self._sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in
_realProduce
domain = self._findDomain(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in
_findDomain
return findMethod(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 145, in
findDomain
return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 135, in
findDomainPath
raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist:
(u'ec67be6b-367f-4e41-a686-695ab50c0034',)
2019-10-17 06:01:50,253+0300 INFO (periodic/42) [vdsm.api] START repoStats(domains=())
from=internal, task_id=25ab44e7-26f7-4494-a970-f73f01892156 (api:48)
2019-10-17 06:01:50,253+0300 INFO (periodic/42) [vdsm.api] FINISH repoStats
return={u'3c973b94-a57d-4462-bcb3-3cef13290327': {'code': 0,
'actual': True, 'version': 5, 'acquired': True, 'delay':
'0.000932761', '
lastCheck': '3.6', 'valid': True},
u'ec67be6b-367f-4e41-a686-695ab50c0034': {'code': 2001, 'actual':
True, 'version': -1, 'acquired': False, 'delay': '0',
'lastCheck': '0.9', 'valid': False}, u'a1ecca6f-1487-492
d-b9bf-d3fb60f1bc99': {'code': 0, 'actual': True, 'version':
5, 'acquired': True, 'delay': '0.00142361', 'lastCheck':
'3.7', 'valid': True}, u'd5c52bc8-1a83-4915-b82f-d98e33db7c99':
{'code': 0, 'actual': True, '
version': 5, 'acquired': True, 'delay': '0.00124982',
'lastCheck': '3.6', 'valid': True},
u'453eb3c2-c3a5-4d5f-8c1d-b16588fc5eff': {'code': 0, 'actual':
True, 'version': 5, 'acquired': True, 'delay':
'0.001343',
'lastCheck': '3.6', 'valid': True}} from=internal,
task_id=25ab44e7-26f7-4494-a970-f73f01892156 (api:54)
2019-10-17 06:01:50,403+0300 WARN (vdsm.Scheduler) [Executor] executor state: count=8
workers=set([<Worker name=qgapoller/58 waiting task#=1 at 0x7fcd61483310>,
<Worker name=qgapoller/53 running <Task discardab
le <Operation action=<VmDispatcher operation=<function <lambda> at
0x7fcd80036668> at 0x7fcd80492850> at 0x7fcd80492890> timeout=10, duration=30.00
at 0x7fcd615adcd0> discarded task#=11 at 0x7fcd615ad2d0>, <Work
er name=qgapoller/56 running <Task discardable <Operation action=<VmDispatcher
operation=<function <lambda> at 0x7fcd80036668> at 0x7fcd80492850> at
0x7fcd80492890> timeout=10, duration=10.00 at 0x7fcd614cd8d0>
discarded task#=11 at 0x7fcd610013d0>, <Worker name=qgapoller/60 waiting task#=0 at
0x7fcd612cd910>, <Worker name=qgapoller/57 running <Task discardable
<Operation action=<VmDispatcher operation=<function <lambd
a> at 0x7fcd80036668> at 0x7fcd80492850> at 0x7fcd80492890> timeout=10,
duration=20.00 at 0x7fcd8003f190> discarded task#=0 at 0x7fcd3c355310>, <Worker
name=qgapoller/55 running <Task discardable <Operation acti
on=<VmDispatcher operation=<function <lambda> at 0x7fcd80036668> at
0x7fcd80492850> at 0x7fcd80492890> timeout=10, duration=40.00 at 0x7fcd61483bd0>
discarded task#=57 at 0x7fcd611dd150>, <Worker name=qgapoller/
59 waiting task#=0 at 0x7fcd614318d0>, <Worker name=qgapoller/54 waiting task#=14 at
0x7fcd3c3f5fd0>]) (executor:213)
2019-10-17 06:01:50,403+0300 INFO (vdsm.Scheduler) [Executor] Worker discarded:
<Worker name=qgapoller/56 running <Task discardable <Operation
action=<VmDispatcher operation=<function <lambda> at 0x7fcd80036668
at 0x7fcd80492850> at 0x7fcd80492890> timeout=10,
duration=10.00 at 0x7fcd614cd8d0> discarded task#=11 at 0x7fcd610013d0>
(executor:355)
2019-10-17 06:01:52,094+0300 WARN (vdsm.Scheduler) [Executor]
executor state: count=10 workers=set([<Worker name=periodic/42 waiting task#=4 at
0x7fcd3c592850>, <Worker name=periodic/43 waiting task#=0 at 0x7fc
d6124a090>, <Worker name=periodic/39 running <Task discardable <Operation
action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at
0x7fcd3c5a52d0> timeout=7.5, duration=37.50 at 0x7fcd61483610>
discarded task#=10 at 0x7fcd8003fed0>, <Worker name=periodic/2 running <Task
discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at
0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, durati
on=832.50 at 0x7fcd3c505f10> discarded task#=287729 at 0x7fcd806020d0>, <Worker
name=periodic/41 waiting task#=7 at 0x7fcd3c5927d0>, <Worker name=periodic/3 running
<Task discardable <Operation action=<vdsm.virt
.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0>
timeout=7.5, duration=802.50 at 0x7fcd6162d550> discarded task#=287772 at
0x7fcd3c592910>, <Worker name=periodic/34 running <Task discard
able <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at
0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, duration=22.50 at 0x7fcd3c529d10>
discarded task#=43 at 0x7fcd3c4add50>, <Worker name=perio
dic/40 running <Task discardable <Operation action=<VmDispatcher
operation=<class 'vdsm.virt.periodic.UpdateVolumes'> at 0x7fcd3c5924d0>
at 0x7fcd3c592e10> timeout=30.0, duration=2.89 at 0x7fcd3c433710> task#=20
at 0x7fcd612cd890>, <Worker name=periodic/38 running <Task discardable
<Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at
0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5, duration=7.50 at 0x7fcd3
c315fd0> discarded task#=18 at 0x7fcd6124a1d0>, <Worker name=periodic/4 running
<Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object
at 0x7fcd3c5a5090> at 0x7fcd3c5a52d0> timeout=7.5
, duration=817.50 at 0x7fcd615d62d0> discarded task#=284811 at 0x7fcd614495d0>])
(executor:213)
2019-10-17 06:01:52,094+0300 INFO (vdsm.Scheduler) [Executor] Worker discarded:
<Worker name=periodic/38 running <Task discardable <Operation
action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7fcd3c5a5090> at
0x7fcd3c5a52d0> timeout=7.5, duration=7.50 at 0x7fcd3c315fd0> discarded task#=18 at
0x7fcd6124a1d0> (executor:355)
2019-10-17 06:01:52,495+0300 INFO (monitor/453eb3c) [IOProcessClient]
(/g2-car0133.gfs-int.uoc.gr:_nfs-exports_isos) Closing client (__init__:610)
2019-10-17 06:01:52,495+0300 INFO (monitor/453eb3c) [IOProcessClient]
(/glusterSD/10.252.166.128:_volume1) Closing client (__init__:610)
2019-10-17 06:01:52,496+0300 INFO (monitor/453eb3c) [IOProcessClient]
(/glusterSD/10.252.166.130:_volume3) Closing client (__init__:610)
2019-10-17 06:01:52,496+0300 INFO (monitor/453eb3c) [IOProcessClient]
(/glusterSD/10.252.166.129:_volume2) Closing client (__init__:610)
2019-10-17 06:01:52,496+0300 INFO (monitor/453eb3c) [IOProcessClient]
(/glusterSD/o5-car0118.gfs-int.uoc.gr:_engine) Closing client (__init__:610)
2019-10-17 06:01:53,123+0300 INFO (itmap/0) [IOProcessClient]
(/glusterSD/o5-car0118.gfs-int.uoc.gr:_engine) Starting client (__init__:308)
2019-10-17 06:01:53,133+0300 INFO (itmap/1) [IOProcessClient]
(/glusterSD/10.252.166.130:_volume3) Starting client (__init__:308)
2019-10-17 06:01:53,136+0300 INFO (ioprocess/28525) [IOProcess]
(/glusterSD/o5-car0118.gfs-int.uoc.gr:_engine) Starting ioprocess (__init__:434)
2019-10-17 06:01:53,149+0300 INFO (itmap/2) [IOProcessClient]
(/glusterSD/10.252.166.128:_volume1) Starting client (__init__:308)
2019-10-17 06:01:53,152+0300 INFO (ioprocess/28534) [IOProcess]
(/glusterSD/10.252.166.130:_volume3) Starting ioprocess (__init__:434)
2019-10-17 06:01:53,161+0300 INFO (itmap/3) [IOProcessClient]
(/glusterSD/10.252.166.129:_volume2) Starting client (__init__:308)
2019-10-17 06:01:53,165+0300 WARN (monitor/a1ecca6) [storage.PersistentDict] Could not
parse line
`^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
so on
2019-10-17 06:01:53,166+0300 INFO (ioprocess/28540) [IOProcess]
(/glusterSD/10.252.166.128:_volume1) Starting ioprocess (__init__:434)
2019-10-17 06:01:53,175+0300 INFO (ioprocess/28547) [IOProcess]
(/glusterSD/10.252.166.129:_volume2) Starting ioprocess (__init__:434)
2019-10-17 06:01:54,763+0300 INFO (monitor/a1ecca6) [storage.StorageDomain] Removing
remnants of deleted images [] (fileSD:726)
2019-10-17 06:01:55,271+0300 INFO (jsonrpc/4) [api.host] START getAllVmStats()
from=::1,32814 (api:48)
2019-10-17 06:01:55,274+0300 WARN (jsonrpc/4) [virt.vm]
(vmId='38586008-37a0-4400-83b2-771090c50a9a') monitor became unresponsive (command
timeout, age=850.680000001) (vm:6061)
2019-10-17 06:01:55,277+0300 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker
name=jsonrpc/3 running <Task <JsonRpcTask {'params': {}, 'jsonrpc':
'2.0', 'method': u'Host.getAllVmStats', 'id':
u'33b72799-68cd-419c-bb6c-d9c9f61664b5'} at 0x7fcd3c329d10> timeout=60,
duration=60.01 at 0x7fcd3c3298d0> task#=39322 at 0x7fcd3c581ed0>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 195, in
run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in
_execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in
__call__
self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in
__call__
self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in
_serveRequest
response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in
_handle_request
res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 198, in
_dynamicMethod
result = fn(*methodArgs)
File: "<string>", line 2, in getAllVmStats
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1395, in
getAllVmStats
statsList = self._cif.getAllVmStats()
File: "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 569, in
getAllVmStats
return [v.getStats() for v in self.getVMs().values()]
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1781, in
getStats
oga_stats = self._getGuestStats()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2000, in
_getGuestStats
self._update_guest_disk_mapping()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2015, in
_update_guest_disk_mapping
self._sync_metadata()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5142, in
_sync_metadata
self._md_desc.dump(self._dom)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/metadata.py", line 515, in
dump
0)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 100, in f
ret = attr(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line
131, in wrapper
ret = f(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 94, in
wrapper
return func(inst, *args, **kwargs)
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2364, in
setMetadata
ret = libvirtmod.virDomainSetMetadata(self._o, type, metadata, key, uri, flags)
(executor:363)
2019-10-17 06:01:55,279+0300 WARN (jsonrpc/3) [virt.vm]
(vmId='b14cbec9-7acb-4f80-ad5e-3169e17d5328') Couldn't update metadata: Timed
out during operation: cannot acquire state change lock (held by
monitor=remoteDispatchConnectGetAllDomainStats) (vm:2018)
2019-10-17 06:01:55,285+0300 INFO (jsonrpc/3) [api.host] FINISH getAllVmStats
return={'status': {'message': 'Done', 'code': 0},
'statsList': (suppressed)} from=::1,32814 (api:54)
2019-10-17 06:01:55,290+0300 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call
Host.getAllVmStats succeeded in 60.02 seconds (__init__:312)