
Hi oVirt land Hope you are well. Got a bit of an issue, actually a big issue. We had some sort of dip of some sort. All the VM's is still running, but some of the hosts is show "Unassigned" or "NonResponsive". So all the hosts was showing UP and was fine before our dip. So I did increase vdsHeartbeatInSecond to 240, no luck. I still get a timeout on the engine lock even thou I can connect to that host from the engine using nc to test to port 54321. I also did restart vdsmd and also rebooted the host with no luck. nc -v someserver 54321 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 172.40.2.172:54321. 2020-08-05 20:20:34,256+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues Any troubleshoot ideas will be gladly appreciated. Regards Nar

After rebooting the node, have you "marked" it that it was rebooted ? Best Regards, Strahil Nikolov На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys <nardusg@gmail.com> написа:
Hi oVirt land
Hope you are well. Got a bit of an issue, actually a big issue. We had some sort of dip of some sort. All the VM's is still running, but some of the hosts is show "Unassigned" or "NonResponsive". So all the hosts was showing UP and was fine before our dip. So I did increase vdsHeartbeatInSecond to 240, no luck.
I still get a timeout on the engine lock even thou I can connect to that host from the engine using nc to test to port 54321. I also did restart vdsmd and also rebooted the host with no luck.
nc -v someserver 54321 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 172.40.2.172:54321.
2020-08-05 20:20:34,256+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
Any troubleshoot ideas will be gladly appreciated.
Regards
Nar

Hi Strahil Hope you are well. I get the following error when I tried to confirm reboot: Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting". And I can't put it in maintenance, only option is "restart" or "stop". Regards Nar On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
After rebooting the node, have you "marked" it that it was rebooted ?
Best Regards, Strahil Nikolov
На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi oVirt land
Hope you are well. Got a bit of an issue, actually a big issue. We had some sort of dip of some sort. All the VM's is still running, but some of the hosts is show "Unassigned" or "NonResponsive". So all the hosts was showing UP and was fine before our dip. So I did increase vdsHeartbeatInSecond to 240, no luck.
I still get a timeout on the engine lock even thou I can connect to that host from the engine using nc to test to port 54321. I also did restart vdsmd and also rebooted the host with no luck.
nc -v someserver 54321 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 172.40.2.172:54321.
2020-08-05 20:20:34,256+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
Any troubleshoot ideas will be gladly appreciated.
Regards
Nar

Can you fheck for errors on the affected host. Most probably you need the vdsm logs. Best Regards, Strahil Nikolov На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys <nardusg@gmail.com> написа:
Hi Strahil
Hope you are well. I get the following error when I tried to confirm reboot:
Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting".
And I can't put it in maintenance, only option is "restart" or "stop".
Regards
Nar
On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
After rebooting the node, have you "marked" it that it was rebooted ?
Best Regards, Strahil Nikolov
На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi oVirt land
Hope you are well. Got a bit of an issue, actually a big issue. We had some sort of dip of some sort. All the VM's is still running, but some of the hosts is show "Unassigned" or "NonResponsive". So all the hosts was showing UP and was fine before our dip. So I did increase vdsHeartbeatInSecond to 240, no luck.
I still get a timeout on the engine lock even thou I can connect to that host from the engine using nc to test to port 54321. I also did restart vdsmd and also rebooted the host with no luck.
nc -v someserver 54321 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 172.40.2.172:54321.
2020-08-05 20:20:34,256+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
Any troubleshoot ideas will be gladly appreciated.
Regards
Nar

Restarted vdsmd on host: mom.log: 2020-08-06 07:21:19,053 - mom.GuestManager - INFO - Guest Manager ending 2020-08-06 07:21:20,483 - mom.HostMonitor - INFO - Host Monitor ending 2020-08-06 07:21:24,795 - mom - INFO - MOM starting 2020-08-06 07:21:24,833 - mom - INFO - hypervisor interface vdsmjsonrpcclient 2020-08-06 07:21:24,833 - mom.HostMonitor - INFO - Host Monitor starting 2020-08-06 07:21:24,880 - mom - ERROR - Failed to initialize MOM threads Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/mom/__init__.py", line 29, in run hypervisor_iface = self.get_hypervisor_interface() File "/usr/lib/python2.7/site-packages/mom/__init__.py", line 217, in get_hypervisor_interface return module.instance(self.config) File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcclientInterface.py", line 96, in instance return JsonRpcVdsmClientInterface() File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcclientInterface.py", line 31, in __init__ self._vdsm_api = client.connect(host="localhost") File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 157, in connect raise ConnectionError(host, port, use_tls, timeout, e) ConnectionError: Connection to localhost:54321 with use_tls=True, timeout=60 failed: [Errno 111] Connection refused 2020-08-06 07:21:30,085 - mom - INFO - MOM starting 2020-08-06 07:21:30,122 - mom.HostMonitor - INFO - Host Monitor starting 2020-08-06 07:21:30,123 - mom - INFO - hypervisor interface vdsmjsonrpcclient 2020-08-06 07:21:30,217 - mom.HostMonitor - INFO - HostMonitor is ready 2020-08-06 07:21:30,221 - mom.GuestManager - INFO - Guest Manager starting: multi-thread 2020-08-06 07:21:30,226 - mom.Policy - INFO - Loaded policy '00-defines' 2020-08-06 07:21:30,228 - mom.Policy - INFO - Loaded policy '01-parameters' 2020-08-06 07:21:30,241 - mom.Policy - INFO - Loaded policy '02-balloon' 2020-08-06 07:21:30,263 - mom.Policy - INFO - Loaded policy '03-ksm' 2020-08-06 07:21:30,290 - mom.Policy - INFO - Loaded policy '04-cputune' 2020-08-06 07:21:30,321 - mom.Policy - INFO - Loaded policy '05-iotune' 2020-08-06 07:21:30,321 - mom.PolicyEngine - INFO - Policy Engine starting 2020-08-06 07:21:30,322 - mom.RPCServer - INFO - Using unix socket /var/run/vdsm/mom-vdsm.sock 2020-08-06 07:21:30,323 - mom.RPCServer - INFO - RPC Server starting 2020-08-06 07:21:40,692 - mom.RPCServer - INFO - ping() 2020-08-06 07:21:40,692 - mom.RPCServer - INFO - getStatistics() 2020-08-06 07:21:45,356 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:0 merge_across_nodes:1 run:0 sleep_millisecs:0 2020-08-06 07:21:55,838 - mom.RPCServer - INFO - ping() 2020-08-06 07:21:55,839 - mom.RPCServer - INFO - getStatistics() supervdsm.log: MainProcess|jsonrpc/3::DEBUG::2020-08-05 20:11:14,139::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call ksmTune with ({u'run': 0, u'merge_across_nodes': 1},) {} MainProcess|jsonrpc/3::DEBUG::2020-08-05 20:11:14,139::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None MainProcess::DEBUG::2020-08-06 07:21:25,279::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call multipath_status with (<vdsm.supervdsm_server._SuperVdsm object at 0x7fcbf8dd0450>,) {} MainProcess::DEBUG::2020-08-06 07:21:25,279::logutils::319::root::(_report_stats) ThreadedHandler is ok in the last 40234 seconds (max pending: 3) MainProcess::DEBUG::2020-08-06 07:21:25,279::commands::198::storage.Misc.excCmd::(execCmd) /usr/bin/taskset --cpu-list 0-95 /usr/sbin/dmsetup status --target multipath (cwd None) MainProcess::DEBUG::2020-08-06 07:21:25,283::commands::219::storage.Misc.excCmd::(execCmd) SUCCESS: <err> = ''; <rc> = 0 MainProcess::DEBUG::2020-08-06 07:21:25,289::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return multipath_status with {u'T1_58886_2121': [PathStatus(name=u'sdd', status=u'A'), PathStatus(name=u'sdm', status=u'A')], u'T0_someserver_boot_58886_20c2': [PathStatus(name=u'sdi', status=u'A'), PathStatus(name=u'sdr', status=u'A')], u'T0_R4_UCS_MOB1P_DIGIT_58886_20c8': [PathStatus(name=u'sdg', status=u'A'), PathStatus(name=u'sdp', status=u'A')], u'T0_58886_215d': [PathStatus(name=u'sdb', status=u'A'), PathStatus(name=u'sdk', status=u'A')], u'T0_R4_UCS_MOB1P_DIGIT_58886_20c7': [PathStatus(name=u'sdf', status=u'A'), PathStatus(name=u'sdo', status=u'A')], u'T0_58886_20b8': [PathStatus(name=u'sde', status=u'A'), PathStatus(name=u'sdn', status=u'A')], u'T0_58886_208a': [PathStatus(name=u'sdc', status=u'A'), PathStatus(name=u'sdl', status=u'A')], u'T0_58886_2124': [PathStatus(name=u'sdh', status=u'A'), PathStatus(name=u'sdq', status=u'A')], u'T0_58886_215c': [PathStatus(name=u'sda', status=u'A'), PathStatus(name=u'sdj', status=u'A')]} MainProcess|hsm/init::DEBUG::2020-08-06 07:21:25,383::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call hbaRescan with (<vdsm.supervdsm_server._SuperVdsm object at 0x7fcbf8dd0450>,) {} MainProcess|hsm/init::DEBUG::2020-08-06 07:21:25,384::commands::198::storage.HBA::(execCmd) /usr/bin/taskset --cpu-list 0-95 /usr/libexec/vdsm/fc-scan (cwd None) MainProcess|hsm/init::DEBUG::2020-08-06 07:21:25,450::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return hbaRescan with None MainProcess|jsonrpc/3::DEBUG::2020-08-06 07:21:45,359::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) call ksmTune with ({u'run': 0, u'merge_across_nodes': 1},) {} MainProcess|jsonrpc/3::DEBUG::2020-08-06 07:21:45,359::supervdsm_server::106::SuperVdsm.ServerCallback::(wrapper) return ksmTune with None vdsm.log 2020-08-06 07:21:22,529+0200 INFO (MainThread) [vds] Received signal 15, shutting down (vdsmd:71) [130/49579] 2020-08-06 07:21:22,672+0200 INFO (MainThread) [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:442) 2020-08-06 07:21:22,674+0200 INFO (MainThread) [vds] Stopping http server (http:80) 2020-08-06 07:21:22,674+0200 INFO (http) [vds] Server stopped (http:70) 2020-08-06 07:21:22,674+0200 INFO (MainThread) [root] Unregistering all secrets (secret:93) 2020-08-06 07:21:22,674+0200 INFO (MainThread) [vds] Stopping QEMU-GA poller (qemuguestagent:158) 2020-08-06 07:21:22,675+0200 INFO (MainThread) [vdsm.api] START prepareForShutdown(options=None) from=internal, task_id=251a8da8-ef3c-4740-842a-42bca0541551 (api:48) 2020-08-06 07:21:22,732+0200 INFO (MainThread) [storage.Monitor] Shutting down domain monitors (monitor:223) 2020-08-06 07:21:22,732+0200 INFO (MainThread) [storage.check] Stopping check service (check:104) 2020-08-06 07:21:22,732+0200 INFO (check/loop) [storage.asyncevent] Stopping <EventLoop running=False closed=False at 0x140458524164304> (asyncevent:220) 2020-08-06 07:21:22,732+0200 INFO (MainThread) [storage.udev] Stopping multipath event listener (udev:149) 2020-08-06 07:21:22,733+0200 INFO (MainThread) [vdsm.api] FINISH prepareForShutdown return=None from=internal, task_id=251a8da8-ef3c-4740-842a-42bca0541551 (api:54) 2020-08-06 07:21:22,733+0200 INFO (MainThread) [vds] Stopping threads (vdsmd:163) 2020-08-06 07:21:22,733+0200 INFO (MainThread) [vds] Exiting (vdsmd:174) 2020-08-06 07:21:25,134+0200 INFO (MainThread) [vds] (PID: 217998) I am the actual vdsm 4.30.43.1 someserver (3.10.0-1127.el7.x86_64) (vdsmd:152) 2020-08-06 07:21:25,135+0200 INFO (MainThread) [vds] VDSM will run with cpu affinity: frozenset([1]) (vdsmd:264) 2020-08-06 07:21:25,138+0200 INFO (MainThread) [storage.HSM] START HSM init (hsm:367) 2020-08-06 07:21:25,140+0200 INFO (MainThread) [storage.HSM] Creating data-center mount directory '/rhev/data-center/mnt' (hsm:372) 2020-08-06 07:21:25,140+0200 INFO (MainThread) [storage.fileUtils] Creating directory: /rhev/data-center/mnt mode: None (fileUtils:199) 2020-08-06 07:21:25,269+0200 INFO (MainThread) [storage.udev] Registering multipath event monitor <vdsm.storage.mpathhealth.Monitor object at 0x7fa060286d50> (udev:182) 2020-08-06 07:21:25,269+0200 INFO (MainThread) [storage.udev] Starting multipath event listener (udev:116) 2020-08-06 07:21:25,289+0200 INFO (MainThread) [storage.check] Starting check service (check:91) 2020-08-06 07:21:25,293+0200 INFO (MainThread) [storage.Dispatcher] Starting StorageDispatcher... (dispatcher:47) 2020-08-06 07:21:25,294+0200 INFO (check/loop) [storage.asyncevent] Starting <EventLoop running=True closed=False at 0x140326784999632> (asyncevent:125) 2020-08-06 07:21:25,316+0200 INFO (MainThread) [vdsm.api] START registerDomainStateChangeCallback(callbackFunc=<functools.partial object at 0x7fa06026a628>) from=internal, task_id=c31419c5-486f-4bdd-b206-207e61 a64825 (api:48) 2020-08-06 07:21:25,316+0200 INFO (MainThread) [vdsm.api] FINISH registerDomainStateChangeCallback return=None from=internal, task_id=c31419c5-486f-4bdd-b206-207e61a64825 (api:54) 2020-08-06 07:21:25,316+0200 INFO (MainThread) [MOM] Preparing MOM interface (momIF:45) 2020-08-06 07:21:25,316+0200 INFO (MainThread) [MOM] MOM: Using named unix socket: /var/run/vdsm/mom-vdsm.sock (momIF:55) 2020-08-06 07:21:25,317+0200 INFO (MainThread) [root] Unregistering all secrets (secret:93) 2020-08-06 07:21:25,330+0200 INFO (MainThread) [vds] Setting channels' timeout to 30 seconds. (vmchannels:224) 2020-08-06 07:21:25,331+0200 INFO (MainThread) [vds] Starting QEMU-GA poller (qemuguestagent:151) 2020-08-06 07:21:25,333+0200 INFO (MainThread) [vds.MultiProtocolAcceptor] Listening at :::54321 (protocoldetector:183) 2020-08-06 07:21:25,402+0200 INFO (http) [vds] Server running (http:59) 2020-08-06 07:21:25,471+0200 INFO (vmrecovery) [vds] VM recovered: 4c044f9e-afdc-4ee7-bfa4-4b2b94115818 (recovery:97) 2020-08-06 07:21:25,472+0200 INFO (vmrecovery) [vds] recovery [1:1/3]: recovered domain 4c044f9e-afdc-4ee7-bfa4-4b2b94115818 (recovery:123) 2020-08-06 07:21:25,480+0200 INFO (hsm/init) [storage.LVM] Overriding read_only mode current=True override=False (lvm:402) 2020-08-06 07:21:25,507+0200 INFO (vm/4c044f9e) [virt.vm] (vmId='4c044f9e-afdc-4ee7-bfa4-4b2b94115818') VM wrapper has started (vm:2786) 2020-08-06 07:21:25,510+0200 INFO (vm/4c044f9e) [vdsm.api] START getVolumeSize(sdUUID='531b2674-0bfa-4d71-bc58-85e280e867b9', spUUID='a5f4c7eb-3640-49f9-b644-57f10e9f5279', imgUUID='c7bab10b-9721-4713-84d4-5718 0f8ce2f4', volUUID='1ed27351-77f9-4b69-acbe-a9bc35ed147a', options=None) from=internal, task_id=22cb50a2-fba6-4430-a47e-b1f6da73e7ed (api:48) 2020-08-06 07:21:25,516+0200 INFO (vmrecovery) [vds] VM recovered: 543765d5-daee-41da-a9d4-34f36c99be82 (recovery:97) 2020-08-06 07:21:25,517+0200 INFO (vmrecovery) [vds] recovery [1:2/3]: recovered domain 543765d5-daee-41da-a9d4-34f36c99be82 (recovery:123) 2020-08-06 07:21:25,530+0200 INFO (vm/543765d5) [virt.vm] (vmId='543765d5-daee-41da-a9d4-34f36c99be82') VM wrapper has started (vm:2786) 2020-08-06 07:21:25,532+0200 INFO (vm/543765d5) [vdsm.api] START getVolumeSize(sdUUID='531b2674-0bfa-4d71-bc58-85e280e867b9', spUUID='a5f4c7eb-3640-49f9-b644-57f10e9f5279', imgUUID='e4fcb0fc-3900-4422-a441-2147 bf8571f3', volUUID='12b3489b-90f5-4b94-aec1-1725296cd66e', options=None) from=internal, task_id=a1c0460b-4195-4b1c-af8a-d34474c5fa2e (api:48) 2020-08-06 07:21:25,545+0200 INFO (vmrecovery) [vds] VM recovered: 4b472d29-5f40-438f-8bca-9b2affbd3d37 (recovery:97) 2020-08-06 07:21:25,546+0200 INFO (vmrecovery) [vds] recovery [1:3/3]: recovered domain 4b472d29-5f40-438f-8bca-9b2affbd3d37 (recovery:123) 2020-08-06 07:21:25,546+0200 INFO (vmrecovery) [vds] recovery: waiting for 3 domains to go up (clientIF:717) 2020-08-06 07:21:25,546+0200 INFO (vm/4b472d29) [virt.vm] (vmId='4b472d29-5f40-438f-8bca-9b2affbd3d37') VM wrapper has started (vm:2786) 2020-08-06 07:21:25,546+0200 INFO (periodic/1) [vdsm.api] START repoStats(domains=()) from=internal, task_id=d8e1926b-d289-4dc1-a81e-ec1592083751 (api:48) 2020-08-06 07:21:25,547+0200 INFO (periodic/1) [vdsm.api] FINISH repoStats return={} from=internal, task_id=d8e1926b-d289-4dc1-a81e-ec1592083751 (api:54) 2020-08-06 07:22:41,636+0200 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:723) 2020-08-06 07:22:45,317+0200 INFO (jsonrpc/7) [api.host] START getAllVmStats() from=::ffff:127.0.0.1,41540 (api:48) 2020-08-06 07:22:45,317+0200 INFO (jsonrpc/7) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::ffff:127.0.0.1,41540 (api:54) 2020-08-06 07:22:45,318+0200 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:312) 2020-08-06 07:22:46,642+0200 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=ad72b73f-d970-4367-a037-94bef589da57 (api:48) 2020-08-06 07:22:46,642+0200 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=ad72b73f-d970-4367-a037-94bef589da57 (api:54) 2020-08-06 07:22:46,642+0200 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:723) 2020-08-06 07:22:51,648+0200 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=f14156c0-a819-4fec-aef7-f5f5b1dd0144 (api:48) 2020-08-06 07:22:51,648+0200 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=f14156c0-a819-4fec-aef7-f5f5b1dd0144 (api:54) 2020-08-06 07:22:51,649+0200 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:723) 2020-08-06 07:22:56,409+0200 INFO (periodic/1) [vdsm.api] START repoStats(domains=()) from=internal, task_id=75381d37-a06d-4174-b956-f5e7f108ea39 (api:48) 2020-08-06 07:22:56,409+0200 INFO (periodic/1) [vdsm.api] FINISH repoStats return={} from=internal, task_id=75381d37-a06d-4174-b956-f5e7f108ea39 (api:54) 2020-08-06 07:22:56,654+0200 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=8f98ae7e-a681-4af3-a4d8-12f0501c8f68 (api:48) 2020-08-06 07:22:56,654+0200 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=8f98ae7e-a681-4af3-a4d8-12f0501c8f68 (api:54) 2020-08-06 07:22:56,655+0200 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:723) 2020-08-06 07:23:00,337+0200 INFO (jsonrpc/0) [api.host] START getAllVmStats() from=::ffff:127.0.0.1,41540 (api:48) 2020-08-06 07:23:00,337+0200 INFO (jsonrpc/0) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::ffff:127.0.0.1,41540 (api:54) 2020-08-06 07:23:00,337+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:312) 2020-08-06 07:23:01,660+0200 INFO (vmrecovery) [vdsm.api] START getConnectedStoragePoolsList(options=None) from=internal, task_id=f984accc-8e20-48b6-ba24-e00418d28043 (api:48) 2020-08-06 07:23:01,660+0200 INFO (vmrecovery) [vdsm.api] FINISH getConnectedStoragePoolsList return={'poollist': []} from=internal, task_id=f984accc-8e20-48b6-ba24-e00418d28043 (api:54) 2020-08-06 07:23:01,660+0200 INFO (vmrecovery) [vds] recovery: waiting for storage pool to go up (clientIF:723) I see "[vds] recovery: waiting for storage pool to go up (clientIF:723)" alot. Regards Nardus On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Can you fheck for errors on the affected host. Most probably you need the vdsm logs.
Best Regards, Strahil Nikolov
На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi Strahil
Hope you are well. I get the following error when I tried to confirm reboot:
Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting".
And I can't put it in maintenance, only option is "restart" or "stop".
Regards
Nar
On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
After rebooting the node, have you "marked" it that it was rebooted ?
Best Regards, Strahil Nikolov
На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi oVirt land
Hope you are well. Got a bit of an issue, actually a big issue. We had some sort of dip of some sort. All the VM's is still running, but some of the hosts is show "Unassigned" or "NonResponsive". So all the hosts was showing UP and was fine before our dip. So I did increase vdsHeartbeatInSecond to 240, no luck.
I still get a timeout on the engine lock even thou I can connect to that host from the engine using nc to test to port 54321. I also did restart vdsmd and also rebooted the host with no luck.
nc -v someserver 54321 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 172.40.2.172:54321.
2020-08-05 20:20:34,256+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
Any troubleshoot ideas will be gladly appreciated.
Regards
Nar

Also see this in engine: Aug 6, 2020, 7:37:17 AM VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Can you fheck for errors on the affected host. Most probably you need the vdsm logs.
Best Regards, Strahil Nikolov
На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi Strahil
Hope you are well. I get the following error when I tried to confirm reboot:
Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting".
And I can't put it in maintenance, only option is "restart" or "stop".
Regards
Nar
On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
After rebooting the node, have you "marked" it that it was rebooted ?
Best Regards, Strahil Nikolov
На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi oVirt land
Hope you are well. Got a bit of an issue, actually a big issue. We had some sort of dip of some sort. All the VM's is still running, but some of the hosts is show "Unassigned" or "NonResponsive". So all the hosts was showing UP and was fine before our dip. So I did increase vdsHeartbeatInSecond to 240, no luck.
I still get a timeout on the engine lock even thou I can connect to that host from the engine using nc to test to port 54321. I also did restart vdsmd and also rebooted the host with no luck.
nc -v someserver 54321 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 172.40.2.172:54321.
2020-08-05 20:20:34,256+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
Any troubleshoot ideas will be gladly appreciated.
Regards
Nar

Hi Nardus, You might have hit an issue I have been hunting for some time ( [1] and [2] ). [1] could not be properly resolved because at a time was not able to recreate an issue on dev setup. I suspect [2] is related. Would you be able to prepare a thread dump from your engine instance? Additionally, please check for potential libvirt errors/warnings. Can you also paste the output of: sudo yum list installed | grep vdsm sudo yum list installed | grep ovirt-engine sudo yum list installed | grep libvirt Usually, according to previous reports, restarting the engine helps to restore connectivity with hosts ... at least for some time. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338 regards, Artur On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Also see this in engine:
Aug 6, 2020, 7:37:17 AM VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Can you fheck for errors on the affected host. Most probably you need the vdsm logs.
Best Regards, Strahil Nikolov
На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi Strahil
Hope you are well. I get the following error when I tried to confirm reboot:
Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting".
And I can't put it in maintenance, only option is "restart" or "stop".
Regards
Nar
On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
After rebooting the node, have you "marked" it that it was rebooted ?
Best Regards, Strahil Nikolov
На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi oVirt land
Hope you are well. Got a bit of an issue, actually a big issue. We had some sort of dip of some sort. All the VM's is still running, but some of the hosts is show "Unassigned" or "NonResponsive". So all the hosts was showing UP and was fine before our dip. So I did increase vdsHeartbeatInSecond to 240, no luck.
I still get a timeout on the engine lock even thou I can connect to that host from the engine using nc to test to port 54321. I also did restart vdsmd and also rebooted the host with no luck.
nc -v someserver 54321 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 172.40.2.172:54321.
2020-08-05 20:20:34,256+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
Any troubleshoot ideas will be gladly appreciated.
Regards
Nar
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2...
-- Artur Socha Senior Software Engineer, RHV Red Hat

Hi [root@engine-aa-1-01 ovirt-engine]# sudo yum list installed | grep vdsm vdsm-jsonrpc-java.noarch 1.4.18-1.el7 @ovirt-4.3 [root@engine-aa-1-01 ovirt-engine]# sudo yum list installed | grep vdsm vdsm-jsonrpc-java.noarch 1.4.18-1.el7 @ovirt-4.3 [root@engine-aa-1-01 ovirt-engine]# sudo yum list installed | grep ovirt-engine ovirt-engine.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-api-explorer.noarch 0.0.5-1.el7 @ovirt-4.3 ovirt-engine-backend.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-dbscripts.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-dwh.noarch 4.3.6-1.el7 @ovirt-4.3 ovirt-engine-dwh-setup.noarch 4.3.6-1.el7 @ovirt-4.3 ovirt-engine-extension-aaa-jdbc.noarch 1.1.10-1.el7 @ovirt-4.3 ovirt-engine-extension-aaa-ldap.noarch 1.3.10-1.el7 @ovirt-4.3 ovirt-engine-extension-aaa-ldap-setup.noarch ovirt-engine-extensions-api-impl.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-metrics.noarch 1.3.4.1-1.el7 @ovirt-4.3 ovirt-engine-restapi.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-setup.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-setup-base.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-setup-plugin-cinderlib.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-setup-plugin-ovirt-engine.noarch ovirt-engine-setup-plugin-ovirt-engine-common.noarch ovirt-engine-setup-plugin-vmconsole-proxy-helper.noarch ovirt-engine-setup-plugin-websocket-proxy.noarch ovirt-engine-tools.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-tools-backup.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-ui-extensions.noarch 1.0.10-1.el7 @ovirt-4.3 ovirt-engine-vmconsole-proxy-helper.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-webadmin-portal.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-websocket-proxy.noarch 4.3.6.7-1.el7 @ovirt-4.3 ovirt-engine-wildfly.x86_64 17.0.1-1.el7 @ovirt-4.3 ovirt-engine-wildfly-overlay.noarch 17.0.1-1.el7 @ovirt-4.3 python-ovirt-engine-sdk4.x86_64 4.3.2-2.el7 @ovirt-4.3 python2-ovirt-engine-lib.noarch 4.3.6.7-1.el7 @ovirt-4.3 [root@engine-aa-1-01 ovirt-engine]# sudo yum list installed | grep libvirt [root@engine-aa-1-01 ovirt-engine]# I can send more info if needed. And yes, it looks like sometimes it helps if you restart the engine. Regards Nardus On Thu, 6 Aug 2020 at 14:17, Artur Socha <asocha@redhat.com> wrote:
Hi Nardus, You might have hit an issue I have been hunting for some time ( [1] and [2] ). [1] could not be properly resolved because at a time was not able to recreate an issue on dev setup. I suspect [2] is related.
Would you be able to prepare a thread dump from your engine instance? Additionally, please check for potential libvirt errors/warnings. Can you also paste the output of: sudo yum list installed | grep vdsm sudo yum list installed | grep ovirt-engine sudo yum list installed | grep libvirt
Usually, according to previous reports, restarting the engine helps to restore connectivity with hosts ... at least for some time.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338
regards, Artur
On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Also see this in engine:
Aug 6, 2020, 7:37:17 AM VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Can you fheck for errors on the affected host. Most probably you need the vdsm logs.
Best Regards, Strahil Nikolov
На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi Strahil
Hope you are well. I get the following error when I tried to confirm reboot:
Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting".
And I can't put it in maintenance, only option is "restart" or "stop".
Regards
Nar
On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
After rebooting the node, have you "marked" it that it was rebooted ?
Best Regards, Strahil Nikolov
На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi oVirt land
Hope you are well. Got a bit of an issue, actually a big issue. We had some sort of dip of some sort. All the VM's is still running, but some of the hosts is show "Unassigned" or "NonResponsive". So all the hosts was showing UP and was fine before our dip. So I did increase vdsHeartbeatInSecond to 240, no luck.
I still get a timeout on the engine lock even thou I can connect to that host from the engine using nc to test to port 54321. I also did restart vdsmd and also rebooted the host with no luck.
nc -v someserver 54321 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 172.40.2.172:54321.
2020-08-05 20:20:34,256+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
Any troubleshoot ideas will be gladly appreciated.
Regards
Nar
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2...
-- Artur Socha Senior Software Engineer, RHV Red Hat

Hi Can create thread dump, please send details on howto. Regards Nardus On Thu, 6 Aug 2020 at 14:17, Artur Socha <asocha@redhat.com> wrote:
Hi Nardus, You might have hit an issue I have been hunting for some time ( [1] and [2] ). [1] could not be properly resolved because at a time was not able to recreate an issue on dev setup. I suspect [2] is related.
Would you be able to prepare a thread dump from your engine instance? Additionally, please check for potential libvirt errors/warnings. Can you also paste the output of: sudo yum list installed | grep vdsm sudo yum list installed | grep ovirt-engine sudo yum list installed | grep libvirt
Usually, according to previous reports, restarting the engine helps to restore connectivity with hosts ... at least for some time.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338
regards, Artur
On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Also see this in engine:
Aug 6, 2020, 7:37:17 AM VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Can you fheck for errors on the affected host. Most probably you need the vdsm logs.
Best Regards, Strahil Nikolov
На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi Strahil
Hope you are well. I get the following error when I tried to confirm reboot:
Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting".
And I can't put it in maintenance, only option is "restart" or "stop".
Regards
Nar
On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
After rebooting the node, have you "marked" it that it was rebooted ?
Best Regards, Strahil Nikolov
На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi oVirt land
Hope you are well. Got a bit of an issue, actually a big issue. We had some sort of dip of some sort. All the VM's is still running, but some of the hosts is show "Unassigned" or "NonResponsive". So all the hosts was showing UP and was fine before our dip. So I did increase vdsHeartbeatInSecond to 240, no luck.
I still get a timeout on the engine lock even thou I can connect to that host from the engine using nc to test to port 54321. I also did restart vdsmd and also rebooted the host with no luck.
nc -v someserver 54321 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 172.40.2.172:54321.
2020-08-05 20:20:34,256+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
Any troubleshoot ideas will be gladly appreciated.
Regards
Nar
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2...
-- Artur Socha Senior Software Engineer, RHV Red Hat

Sure thing. On engine host please find jboss pid. You can use this command: ps -ef | grep jboss | grep -v grep | awk '{ print $2 }' or jps tool from jdk. Sample output on my dev environment is: ± % jps !2860 64853 jboss-modules.jar 196217 Jps Then use jstack from jdk: jstack <pid> > your_engine_thread_dump.txt 2 or 3 dumps taken in approximately 5 minutes intervals would be even more useful. Here you can find even more options https://www.baeldung.com/java-thread-dump Artur On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Hi
Can create thread dump, please send details on howto.
Regards
Nardus
On Thu, 6 Aug 2020 at 14:17, Artur Socha <asocha@redhat.com> wrote:
Hi Nardus, You might have hit an issue I have been hunting for some time ( [1] and [2] ). [1] could not be properly resolved because at a time was not able to recreate an issue on dev setup. I suspect [2] is related.
Would you be able to prepare a thread dump from your engine instance? Additionally, please check for potential libvirt errors/warnings. Can you also paste the output of: sudo yum list installed | grep vdsm sudo yum list installed | grep ovirt-engine sudo yum list installed | grep libvirt
Usually, according to previous reports, restarting the engine helps to restore connectivity with hosts ... at least for some time.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338
regards, Artur
On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Also see this in engine:
Aug 6, 2020, 7:37:17 AM VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Can you fheck for errors on the affected host. Most probably you need the vdsm logs.
Best Regards, Strahil Nikolov
На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi Strahil
Hope you are well. I get the following error when I tried to confirm reboot:
Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting".
And I can't put it in maintenance, only option is "restart" or "stop".
Regards
Nar
On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
After rebooting the node, have you "marked" it that it was rebooted ?
Best Regards, Strahil Nikolov
На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа: >Hi oVirt land > >Hope you are well. Got a bit of an issue, actually a big issue. We had >some >sort of dip of some sort. All the VM's is still running, but some of >the >hosts is show "Unassigned" or "NonResponsive". So all the hosts was >showing >UP and was fine before our dip. So I did increase vdsHeartbeatInSecond >to >240, no luck. > >I still get a timeout on the engine lock even thou I can connect to >that >host from the engine using nc to test to port 54321. I also did restart >vdsmd and also rebooted the host with no luck. > > nc -v someserver 54321 >Ncat: Version 7.50 ( https://nmap.org/ncat ) >Ncat: Connected to 172.40.2.172:54321. > >2020-08-05 20:20:34,256+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host >Capabilities failed: Message timeout which can be caused by >communication >issues > >Any troubleshoot ideas will be gladly appreciated. > >Regards > >Nar
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2...
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat

Hi Artur Please find attached, also let me know if I need to rerun. They 5 min apart [root@engine-aa-1-01 ovirt-engine]# ps -ef | grep jboss | grep -v grep | awk '{ print $2 }' 27390 [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_1.txt [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_2.txt [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_3.txt Regards Nar On Thu, 6 Aug 2020 at 15:55, Artur Socha <asocha@redhat.com> wrote:
Sure thing. On engine host please find jboss pid. You can use this command:
ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
or jps tool from jdk. Sample output on my dev environment is:
± % jps !2860 64853 jboss-modules.jar 196217 Jps
Then use jstack from jdk: jstack <pid> > your_engine_thread_dump.txt 2 or 3 dumps taken in approximately 5 minutes intervals would be even more useful.
Here you can find even more options https://www.baeldung.com/java-thread-dump
Artur
On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Hi
Can create thread dump, please send details on howto.
Regards
Nardus
On Thu, 6 Aug 2020 at 14:17, Artur Socha <asocha@redhat.com> wrote:
Hi Nardus, You might have hit an issue I have been hunting for some time ( [1] and [2] ). [1] could not be properly resolved because at a time was not able to recreate an issue on dev setup. I suspect [2] is related.
Would you be able to prepare a thread dump from your engine instance? Additionally, please check for potential libvirt errors/warnings. Can you also paste the output of: sudo yum list installed | grep vdsm sudo yum list installed | grep ovirt-engine sudo yum list installed | grep libvirt
Usually, according to previous reports, restarting the engine helps to restore connectivity with hosts ... at least for some time.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338
regards, Artur
On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Also see this in engine:
Aug 6, 2020, 7:37:17 AM VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Can you fheck for errors on the affected host. Most probably you need the vdsm logs.
Best Regards, Strahil Nikolov
На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа:
Hi Strahil
Hope you are well. I get the following error when I tried to confirm reboot:
Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting".
And I can't put it in maintenance, only option is "restart" or "stop".
Regards
Nar
On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
> After rebooting the node, have you "marked" it that it was rebooted ? > > Best Regards, > Strahil Nikolov > > На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < > nardusg@gmail.com> написа: > >Hi oVirt land > > > >Hope you are well. Got a bit of an issue, actually a big issue. We had > >some > >sort of dip of some sort. All the VM's is still running, but some of > >the > >hosts is show "Unassigned" or "NonResponsive". So all the hosts was > >showing > >UP and was fine before our dip. So I did increase vdsHeartbeatInSecond > >to > >240, no luck. > > > >I still get a timeout on the engine lock even thou I can connect to > >that > >host from the engine using nc to test to port 54321. I also did restart > >vdsmd and also rebooted the host with no luck. > > > > nc -v someserver 54321 > >Ncat: Version 7.50 ( https://nmap.org/ncat ) > >Ncat: Connected to 172.40.2.172:54321. > > > >2020-08-05 20:20:34,256+02 ERROR >
>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: > >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host > >Capabilities failed: Message timeout which can be caused by > >communication > >issues > > > >Any troubleshoot ideas will be gladly appreciated. > > > >Regards > > > >Nar >
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2...
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat

Thanks Nardus, After a quick look I found what I was suspecting - there are way too many threads in Blocked state. I don't know yet the reason but this is very helpful. I'll let you know about the findings/investigation. Meanwhile, you may try restarting the engine as (a very brute and ugly) workaround). You may try to setup slightly bigger thread pool - may save you some time until the next hiccup. However, please be aware that this may come with the cost in memory usage and higher cpu usage (due to increased context switching) Here are some docs: # Specify the thread pool size for jboss managed scheduled executor service used by commands to periodically execute # methods. It is generally not necessary to increase the number of threads in this thread pool. To change the value # permanently create a conf file 99-engine-scheduled-thread-pool.conf in /etc/ovirt-engine/engine.conf.d/ ENGINE_SCHEDULED_THREAD_POOL_SIZE=100 A. On Thu, Aug 6, 2020 at 4:19 PM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Hi Artur
Please find attached, also let me know if I need to rerun. They 5 min apart
[root@engine-aa-1-01 ovirt-engine]# ps -ef | grep jboss | grep -v grep | awk '{ print $2 }' 27390 [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_1.txt [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_2.txt [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_3.txt
Regards
Nar
On Thu, 6 Aug 2020 at 15:55, Artur Socha <asocha@redhat.com> wrote:
Sure thing. On engine host please find jboss pid. You can use this command:
ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
or jps tool from jdk. Sample output on my dev environment is:
± % jps !2860 64853 jboss-modules.jar 196217 Jps
Then use jstack from jdk: jstack <pid> > your_engine_thread_dump.txt 2 or 3 dumps taken in approximately 5 minutes intervals would be even more useful.
Here you can find even more options https://www.baeldung.com/java-thread-dump
Artur
On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Hi
Can create thread dump, please send details on howto.
Regards
Nardus
On Thu, 6 Aug 2020 at 14:17, Artur Socha <asocha@redhat.com> wrote:
Hi Nardus, You might have hit an issue I have been hunting for some time ( [1] and [2] ). [1] could not be properly resolved because at a time was not able to recreate an issue on dev setup. I suspect [2] is related.
Would you be able to prepare a thread dump from your engine instance? Additionally, please check for potential libvirt errors/warnings. Can you also paste the output of: sudo yum list installed | grep vdsm sudo yum list installed | grep ovirt-engine sudo yum list installed | grep libvirt
Usually, according to previous reports, restarting the engine helps to restore connectivity with hosts ... at least for some time.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338
regards, Artur
On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Also see this in engine:
Aug 6, 2020, 7:37:17 AM VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Can you fheck for errors on the affected host. Most probably you need the vdsm logs.
Best Regards, Strahil Nikolov
На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < nardusg@gmail.com> написа: >Hi Strahil > >Hope you are well. I get the following error when I tried to confirm >reboot: > >Error while executing action: Cannot confirm 'Host has been rebooted' >Host. >Valid Host statuses are "Non operational", "Maintenance" or >"Connecting". > >And I can't put it in maintenance, only option is "restart" or "stop". > >Regards > >Nar > >On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com> >wrote: > >> After rebooting the node, have you "marked" it that it was rebooted ? >> >> Best Regards, >> Strahil Nikolov >> >> На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < >> nardusg@gmail.com> написа: >> >Hi oVirt land >> > >> >Hope you are well. Got a bit of an issue, actually a big issue. We >had >> >some >> >sort of dip of some sort. All the VM's is still running, but some of >> >the >> >hosts is show "Unassigned" or "NonResponsive". So all the hosts was >> >showing >> >UP and was fine before our dip. So I did increase >vdsHeartbeatInSecond >> >to >> >240, no luck. >> > >> >I still get a timeout on the engine lock even thou I can connect to >> >that >> >host from the engine using nc to test to port 54321. I also did >restart >> >vdsmd and also rebooted the host with no luck. >> > >> > nc -v someserver 54321 >> >Ncat: Version 7.50 ( https://nmap.org/ncat ) >> >Ncat: Connected to 172.40.2.172:54321. >> > >> >2020-08-05 20:20:34,256+02 ERROR >>
>>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >> >(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: >> >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get Host >> >Capabilities failed: Message timeout which can be caused by >> >communication >> >issues >> > >> >Any troubleshoot ideas will be gladly appreciated. >> > >> >Regards >> > >> >Nar >>
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2...
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat

Hi Nardus, I'm assuming that your setup was stable and you were able to run your VMs without problems. If so, then below is not a solution to your problem, you should really check engine and VDSM logs for reasons why your hosts become NonResponsive. Most probably there is underlying storage or network issue which prevents correct engine <-> hosts communications and which made your hosts NonResponsive. The solution below will just hide the issues you currently have. If your problem started suddenly when you significantly increased the number of running VMs or decreased the number of available hosts, then you are suffering from those issues because of not having enough resources. Regards, Martin On Thu, Aug 6, 2020 at 4:51 PM Artur Socha <asocha@redhat.com> wrote:
Thanks Nardus, After a quick look I found what I was suspecting - there are way too many threads in Blocked state. I don't know yet the reason but this is very helpful. I'll let you know about the findings/investigation. Meanwhile, you may try restarting the engine as (a very brute and ugly) workaround). You may try to setup slightly bigger thread pool - may save you some time until the next hiccup. However, please be aware that this may come with the cost in memory usage and higher cpu usage (due to increased context switching) Here are some docs:
# Specify the thread pool size for jboss managed scheduled executor service used by commands to periodically execute # methods. It is generally not necessary to increase the number of threads in this thread pool. To change the value # permanently create a conf file 99-engine-scheduled-thread-pool.conf in /etc/ovirt-engine/engine.conf.d/ ENGINE_SCHEDULED_THREAD_POOL_SIZE=100
A.
On Thu, Aug 6, 2020 at 4:19 PM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Hi Artur
Please find attached, also let me know if I need to rerun. They 5 min apart
[root@engine-aa-1-01 ovirt-engine]# ps -ef | grep jboss | grep -v grep | awk '{ print $2 }' 27390 [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_1.txt [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_2.txt [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_3.txt
Regards
Nar
On Thu, 6 Aug 2020 at 15:55, Artur Socha <asocha@redhat.com> wrote:
Sure thing. On engine host please find jboss pid. You can use this command:
ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
or jps tool from jdk. Sample output on my dev environment is:
± % jps !2860 64853 jboss-modules.jar 196217 Jps
Then use jstack from jdk: jstack <pid> > your_engine_thread_dump.txt 2 or 3 dumps taken in approximately 5 minutes intervals would be even more useful.
Here you can find even more options https://www.baeldung.com/java-thread-dump
Artur
On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Hi
Can create thread dump, please send details on howto.
Regards
Nardus
On Thu, 6 Aug 2020 at 14:17, Artur Socha <asocha@redhat.com> wrote:
Hi Nardus, You might have hit an issue I have been hunting for some time ( [1] and [2] ). [1] could not be properly resolved because at a time was not able to recreate an issue on dev setup. I suspect [2] is related.
Would you be able to prepare a thread dump from your engine instance? Additionally, please check for potential libvirt errors/warnings. Can you also paste the output of: sudo yum list installed | grep vdsm sudo yum list installed | grep ovirt-engine sudo yum list installed | grep libvirt
Usually, according to previous reports, restarting the engine helps to restore connectivity with hosts ... at least for some time.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338
regards, Artur
On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Also see this in engine:
Aug 6, 2020, 7:37:17 AM VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
> Can you fheck for errors on the affected host. Most probably you > need the vdsm logs. > > Best Regards, > Strahil Nikolov > > На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < > nardusg@gmail.com> написа: > >Hi Strahil > > > >Hope you are well. I get the following error when I tried to confirm > >reboot: > > > >Error while executing action: Cannot confirm 'Host has been > rebooted' > >Host. > >Valid Host statuses are "Non operational", "Maintenance" or > >"Connecting". > > > >And I can't put it in maintenance, only option is "restart" or > "stop". > > > >Regards > > > >Nar > > > >On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com > > > >wrote: > > > >> After rebooting the node, have you "marked" it that it was > rebooted ? > >> > >> Best Regards, > >> Strahil Nikolov > >> > >> На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < > >> nardusg@gmail.com> написа: > >> >Hi oVirt land > >> > > >> >Hope you are well. Got a bit of an issue, actually a big issue. > We > >had > >> >some > >> >sort of dip of some sort. All the VM's is still running, but > some of > >> >the > >> >hosts is show "Unassigned" or "NonResponsive". So all the hosts > was > >> >showing > >> >UP and was fine before our dip. So I did increase > >vdsHeartbeatInSecond > >> >to > >> >240, no luck. > >> > > >> >I still get a timeout on the engine lock even thou I can connect > to > >> >that > >> >host from the engine using nc to test to port 54321. I also did > >restart > >> >vdsmd and also rebooted the host with no luck. > >> > > >> > nc -v someserver 54321 > >> >Ncat: Version 7.50 ( https://nmap.org/ncat ) > >> >Ncat: Connected to 172.40.2.172:54321. > >> > > >> >2020-08-05 20:20:34,256+02 ERROR > >> > > >>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >> >(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: > >> >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get > Host > >> >Capabilities failed: Message timeout which can be caused by > >> >communication > >> >issues > >> > > >> >Any troubleshoot ideas will be gladly appreciated. > >> > > >> >Regards > >> > > >> >Nar > >> > _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2...
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

Hi Nardus, There is one more thing to be checked. 1) could you check if there are any packets sent from the affected host to the engine? on host: # outgoing traffic sudo tcpdump -i <interface_name_on_host> -c 1000 -ttttnnvvS dst <engine_host> 2) same the other way round. Check if there are packets received on engine side from affected host on engine: # incoming traffic sudo tcpdump -i <interface_name_on_engine> -c 1000 -ttttnnvvS src <affected_host> Artur On Thu, Aug 6, 2020 at 4:51 PM Artur Socha <asocha@redhat.com> wrote:
Thanks Nardus, After a quick look I found what I was suspecting - there are way too many threads in Blocked state. I don't know yet the reason but this is very helpful. I'll let you know about the findings/investigation. Meanwhile, you may try restarting the engine as (a very brute and ugly) workaround). You may try to setup slightly bigger thread pool - may save you some time until the next hiccup. However, please be aware that this may come with the cost in memory usage and higher cpu usage (due to increased context switching) Here are some docs:
# Specify the thread pool size for jboss managed scheduled executor service used by commands to periodically execute # methods. It is generally not necessary to increase the number of threads in this thread pool. To change the value # permanently create a conf file 99-engine-scheduled-thread-pool.conf in /etc/ovirt-engine/engine.conf.d/ ENGINE_SCHEDULED_THREAD_POOL_SIZE=100
A.
On Thu, Aug 6, 2020 at 4:19 PM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Hi Artur
Please find attached, also let me know if I need to rerun. They 5 min apart
[root@engine-aa-1-01 ovirt-engine]# ps -ef | grep jboss | grep -v grep | awk '{ print $2 }' 27390 [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_1.txt [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_2.txt [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_3.txt
Regards
Nar
On Thu, 6 Aug 2020 at 15:55, Artur Socha <asocha@redhat.com> wrote:
Sure thing. On engine host please find jboss pid. You can use this command:
ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
or jps tool from jdk. Sample output on my dev environment is:
± % jps !2860 64853 jboss-modules.jar 196217 Jps
Then use jstack from jdk: jstack <pid> > your_engine_thread_dump.txt 2 or 3 dumps taken in approximately 5 minutes intervals would be even more useful.
Here you can find even more options https://www.baeldung.com/java-thread-dump
Artur
On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Hi
Can create thread dump, please send details on howto.
Regards
Nardus
On Thu, 6 Aug 2020 at 14:17, Artur Socha <asocha@redhat.com> wrote:
Hi Nardus, You might have hit an issue I have been hunting for some time ( [1] and [2] ). [1] could not be properly resolved because at a time was not able to recreate an issue on dev setup. I suspect [2] is related.
Would you be able to prepare a thread dump from your engine instance? Additionally, please check for potential libvirt errors/warnings. Can you also paste the output of: sudo yum list installed | grep vdsm sudo yum list installed | grep ovirt-engine sudo yum list installed | grep libvirt
Usually, according to previous reports, restarting the engine helps to restore connectivity with hosts ... at least for some time.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338
regards, Artur
On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Also see this in engine:
Aug 6, 2020, 7:37:17 AM VDSM someserver command Get Host Capabilities failed: Message timeout which can be caused by communication issues
On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
> Can you fheck for errors on the affected host. Most probably you > need the vdsm logs. > > Best Regards, > Strahil Nikolov > > На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < > nardusg@gmail.com> написа: > >Hi Strahil > > > >Hope you are well. I get the following error when I tried to confirm > >reboot: > > > >Error while executing action: Cannot confirm 'Host has been > rebooted' > >Host. > >Valid Host statuses are "Non operational", "Maintenance" or > >"Connecting". > > > >And I can't put it in maintenance, only option is "restart" or > "stop". > > > >Regards > > > >Nar > > > >On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov <hunter86_bg@yahoo.com > > > >wrote: > > > >> After rebooting the node, have you "marked" it that it was > rebooted ? > >> > >> Best Regards, > >> Strahil Nikolov > >> > >> На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < > >> nardusg@gmail.com> написа: > >> >Hi oVirt land > >> > > >> >Hope you are well. Got a bit of an issue, actually a big issue. > We > >had > >> >some > >> >sort of dip of some sort. All the VM's is still running, but > some of > >> >the > >> >hosts is show "Unassigned" or "NonResponsive". So all the hosts > was > >> >showing > >> >UP and was fine before our dip. So I did increase > >vdsHeartbeatInSecond > >> >to > >> >240, no luck. > >> > > >> >I still get a timeout on the engine lock even thou I can connect > to > >> >that > >> >host from the engine using nc to test to port 54321. I also did > >restart > >> >vdsmd and also rebooted the host with no luck. > >> > > >> > nc -v someserver 54321 > >> >Ncat: Version 7.50 ( https://nmap.org/ncat ) > >> >Ncat: Connected to 172.40.2.172:54321. > >> > > >> >2020-08-05 20:20:34,256+02 ERROR > >> > > >>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >> >(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: > >> >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get > Host > >> >Capabilities failed: Message timeout which can be caused by > >> >communication > >> >issues > >> > > >> >Any troubleshoot ideas will be gladly appreciated. > >> > > >> >Regards > >> > > >> >Nar > >> > _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2...
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat

Hi Artur Hope you are well, please see below, this after I restarted the engine: host: [root@ovirt-aa-1-21:~]↥ # tcpdump -i ovirtmgmt -c 1000 -ttttnnvvS dst ovirt-engine-aa-1-01 tcpdump: listening on ovirtmgmt, link-type EN10MB (Ethernet), capture size 262144 bytes 2020-08-07 12:09:32.553543 ARP, Ethernet (len 6), IPv4 (len 4), Reply 172.140.220.111 is-at 00:25:b5:04:00:25, length 28 2020-08-07 12:10:05.584594 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [S.], cksum 0x5cd5 (incorrect -> 0xc8ca), seq 4036072905, ack 3265413231, win 28960, options [mss 1460,sackOK,TS val 3039504636 ecr 341411251,nop,wscale 7], length 0 2020-08-07 12:10:10.589276 ARP, Ethernet (len 6), IPv4 (len 4), Reply 172.140.220.111 is-at 00:25:b5:04:00:25, length 28 2020-08-07 12:10:15.596230 IP (tos 0x0, ttl 64, id 48438, offset 0, flags [DF], proto TCP (6), length 52) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [F.], cksum 0x5ccd (incorrect -> 0x40b8), seq 4036072906, ack 3265413231, win 227, options [nop,nop,TS val 3039514647 ecr 341411251], length 0 2020-08-07 12:10:20.596429 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.140.220.23 tell 172.140.220.111, length 28 2020-08-07 12:10:20.663699 IP (tos 0x0, ttl 64, id 64726, offset 0, flags [DF], proto TCP (6), length 40) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [R], cksum 0x1d20 (correct), seq 4036072907, win 0, length 0 engine [root@ovirt-engine-aa-1-01 ~]# tcpdump -i eth0 -c 1000 -ttttnnvvS src ovirt-aa-1-21 tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 2020-08-07 12:09:31.891242 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [S.], cksum 0xc8ca (correct), seq 4036072905, ack 3265413231, win 28960, options [mss 1460,sackOK,TS val 3039504636 ecr 341411251,nop,wscale 7], length 0 2020-08-07 12:09:36.895502 ARP, Ethernet (len 6), IPv4 (len 4), Reply 172.140.220.111 is-at 00:25:b5:04:00:25, length 42 2020-08-07 12:09:41.901981 IP (tos 0x0, ttl 64, id 48438, offset 0, flags [DF], proto TCP (6), length 52) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [F.], cksum 0x40b8 (correct), seq 4036072906, ack 3265413231, win 227, options [nop,nop,TS val 3039514647 ecr 341411251], length 0 2020-08-07 12:09:46.901681 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.140.220.23 tell 172.140.220.111, length 42 2020-08-07 12:09:46.968911 IP (tos 0x0, ttl 64, id 64726, offset 0, flags [DF], proto TCP (6), length 40) 172.140.220.111.54321 > 172.140.220.23.56202: Flags [R], cksum 0x1d20 (correct), seq 4036072907, win 0, length 0 Regards Nar On Fri, 7 Aug 2020 at 11:54, Artur Socha <asocha@redhat.com> wrote:
Hi Nardus, There is one more thing to be checked.
1) could you check if there are any packets sent from the affected host to the engine? on host: # outgoing traffic sudo tcpdump -i <interface_name_on_host> -c 1000 -ttttnnvvS dst <engine_host>
2) same the other way round. Check if there are packets received on engine side from affected host on engine: # incoming traffic sudo tcpdump -i <interface_name_on_engine> -c 1000 -ttttnnvvS src <affected_host>
Artur
On Thu, Aug 6, 2020 at 4:51 PM Artur Socha <asocha@redhat.com> wrote:
Thanks Nardus, After a quick look I found what I was suspecting - there are way too many threads in Blocked state. I don't know yet the reason but this is very helpful. I'll let you know about the findings/investigation. Meanwhile, you may try restarting the engine as (a very brute and ugly) workaround). You may try to setup slightly bigger thread pool - may save you some time until the next hiccup. However, please be aware that this may come with the cost in memory usage and higher cpu usage (due to increased context switching) Here are some docs:
# Specify the thread pool size for jboss managed scheduled executor service used by commands to periodically execute # methods. It is generally not necessary to increase the number of threads in this thread pool. To change the value # permanently create a conf file 99-engine-scheduled-thread-pool.conf in /etc/ovirt-engine/engine.conf.d/ ENGINE_SCHEDULED_THREAD_POOL_SIZE=100
A.
On Thu, Aug 6, 2020 at 4:19 PM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Hi Artur
Please find attached, also let me know if I need to rerun. They 5 min apart
[root@engine-aa-1-01 ovirt-engine]# ps -ef | grep jboss | grep -v grep | awk '{ print $2 }' 27390 [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_1.txt [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_2.txt [root@engine-aa-1-01 ovirt-engine]# jstack -F 27390 > your_engine_thread_dump_3.txt
Regards
Nar
On Thu, 6 Aug 2020 at 15:55, Artur Socha <asocha@redhat.com> wrote:
Sure thing. On engine host please find jboss pid. You can use this command:
ps -ef | grep jboss | grep -v grep | awk '{ print $2 }'
or jps tool from jdk. Sample output on my dev environment is:
± % jps !2860 64853 jboss-modules.jar 196217 Jps
Then use jstack from jdk: jstack <pid> > your_engine_thread_dump.txt 2 or 3 dumps taken in approximately 5 minutes intervals would be even more useful.
Here you can find even more options https://www.baeldung.com/java-thread-dump
Artur
On Thu, Aug 6, 2020 at 3:15 PM Nardus Geldenhuys <nardusg@gmail.com> wrote:
Hi
Can create thread dump, please send details on howto.
Regards
Nardus
On Thu, 6 Aug 2020 at 14:17, Artur Socha <asocha@redhat.com> wrote:
Hi Nardus, You might have hit an issue I have been hunting for some time ( [1] and [2] ). [1] could not be properly resolved because at a time was not able to recreate an issue on dev setup. I suspect [2] is related.
Would you be able to prepare a thread dump from your engine instance? Additionally, please check for potential libvirt errors/warnings. Can you also paste the output of: sudo yum list installed | grep vdsm sudo yum list installed | grep ovirt-engine sudo yum list installed | grep libvirt
Usually, according to previous reports, restarting the engine helps to restore connectivity with hosts ... at least for some time.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1845152 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1846338
regards, Artur
On Thu, Aug 6, 2020 at 8:01 AM Nardus Geldenhuys <nardusg@gmail.com> wrote:
> Also see this in engine: > > Aug 6, 2020, 7:37:17 AM > VDSM someserver command Get Host Capabilities failed: Message > timeout which can be caused by communication issues > > On Thu, 6 Aug 2020 at 07:09, Strahil Nikolov <hunter86_bg@yahoo.com> > wrote: > >> Can you fheck for errors on the affected host. Most probably you >> need the vdsm logs. >> >> Best Regards, >> Strahil Nikolov >> >> На 6 август 2020 г. 7:40:23 GMT+03:00, Nardus Geldenhuys < >> nardusg@gmail.com> написа: >> >Hi Strahil >> > >> >Hope you are well. I get the following error when I tried to >> confirm >> >reboot: >> > >> >Error while executing action: Cannot confirm 'Host has been >> rebooted' >> >Host. >> >Valid Host statuses are "Non operational", "Maintenance" or >> >"Connecting". >> > >> >And I can't put it in maintenance, only option is "restart" or >> "stop". >> > >> >Regards >> > >> >Nar >> > >> >On Thu, 6 Aug 2020 at 06:16, Strahil Nikolov < >> hunter86_bg@yahoo.com> >> >wrote: >> > >> >> After rebooting the node, have you "marked" it that it was >> rebooted ? >> >> >> >> Best Regards, >> >> Strahil Nikolov >> >> >> >> На 5 август 2020 г. 21:29:04 GMT+03:00, Nardus Geldenhuys < >> >> nardusg@gmail.com> написа: >> >> >Hi oVirt land >> >> > >> >> >Hope you are well. Got a bit of an issue, actually a big issue. >> We >> >had >> >> >some >> >> >sort of dip of some sort. All the VM's is still running, but >> some of >> >> >the >> >> >hosts is show "Unassigned" or "NonResponsive". So all the hosts >> was >> >> >showing >> >> >UP and was fine before our dip. So I did increase >> >vdsHeartbeatInSecond >> >> >to >> >> >240, no luck. >> >> > >> >> >I still get a timeout on the engine lock even thou I can >> connect to >> >> >that >> >> >host from the engine using nc to test to port 54321. I also did >> >restart >> >> >vdsmd and also rebooted the host with no luck. >> >> > >> >> > nc -v someserver 54321 >> >> >Ncat: Version 7.50 ( https://nmap.org/ncat ) >> >> >Ncat: Connected to 172.40.2.172:54321. >> >> > >> >> >2020-08-05 20:20:34,256+02 ERROR >> >> >> >> >>[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >> >> >(EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: >> >> >VDS_BROKER_COMMAND_FAILURE(10,802), VDSM someserver command Get >> Host >> >> >Capabilities failed: Message timeout which can be caused by >> >> >communication >> >> >issues >> >> > >> >> >Any troubleshoot ideas will be gladly appreciated. >> >> > >> >> >Regards >> >> > >> >> >Nar >> >> >> > _______________________________________________ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-leave@ovirt.org > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/C4HB2J3MH76FI2... >
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat
-- Artur Socha Senior Software Engineer, RHV Red Hat
participants (4)
-
Artur Socha
-
Martin Perina
-
Nardus Geldenhuys
-
Strahil Nikolov