
Bryan, I checked your logs and I see extensive activity of ioprocess at that time which seems to cause getAllVmStats to be processed 7 seconds later. The call was invoked on vdsm side much later than we saw heartbeat exceeded exception on the engine side. @Nir please take a look Thanks, Piotr 2017-09-13 04:07:07,126-0500 INFO (itmap/0) [IOProcessClient] Starting client ioprocess-10031 (__init__:330) 2017-09-13 04:07:07,176-0500 INFO (itmap/1) [IOProcessClient] Starting client ioprocess-10032 (__init__:330) 2017-09-13 04:07:07,222-0500 INFO (ioprocess/9466) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,226-0500 INFO (itmap/2) [IOProcessClient] Starting client ioprocess-10033 (__init__:330) 2017-09-13 04:07:07,279-0500 INFO (ioprocess/9474) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,284-0500 INFO (itmap/3) [IOProcessClient] Starting client ioprocess-10034 (__init__:330) 2017-09-13 04:07:07,332-0500 INFO (ioprocess/9482) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,340-0500 INFO (itmap/4) [IOProcessClient] Starting client ioprocess-10035 (__init__:330) 2017-09-13 04:07:07,395-0500 INFO (ioprocess/9502) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,401-0500 INFO (itmap/5) [IOProcessClient] Starting client ioprocess-10036 (__init__:330) 2017-09-13 04:07:07,461-0500 INFO (ioprocess/9523) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,469-0500 INFO (itmap/6) [IOProcessClient] Starting client ioprocess-10037 (__init__:330) 2017-09-13 04:07:07,525-0500 INFO (ioprocess/9538) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,532-0500 INFO (itmap/7) [IOProcessClient] Starting client ioprocess-10038 (__init__:330) 2017-09-13 04:07:07,579-0500 INFO (ioprocess/9556) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,583-0500 INFO (itmap/8) [IOProcessClient] Starting client ioprocess-10039 (__init__:330) 2017-09-13 04:07:07,628-0500 INFO (ioprocess/9569) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,642-0500 INFO (monitor/cacbac0) [storage.StorageDomain] Resource namespace 01_img_cacbac0c-4b08-46cd-ace8-b23266c92291 already registered (sd:727) 2017-09-13 04:07:07,642-0500 INFO (monitor/cacbac0) [storage.StorageDomain] Resource namespace 02_vol_cacbac0c-4b08-46cd-ace8-b23266c92291 already registered (sd:736) 2017-09-13 04:07:07,646-0500 INFO (monitor/e371d38) [storage.StorageDomain] Resource namespace 01_img_e371d380-7194-4950-b901-5f2aed5dfb35 already registered (sd:727) 2017-09-13 04:07:07,646-0500 INFO (monitor/e371d38) [storage.StorageDomain] Resource namespace 02_vol_e371d380-7194-4950-b901-5f2aed5dfb35 already registered (sd:736) 2017-09-13 04:07:07,655-0500 INFO (monitor/ea15319) [storage.StorageDomain] Resource namespace 01_img_ea153191-ac25-4984-aaec-4a59cbf383a5 already registered (sd:727) 2017-09-13 04:07:07,655-0500 INFO (monitor/ea15319) [storage.StorageDomain] Resource namespace 02_vol_ea153191-ac25-4984-aaec-4a59cbf383a5 already registered (sd:736) 2017-09-13 04:07:07,661-0500 INFO (monitor/58a7c5d) [storage.StorageDomain] Resource namespace 01_img_58a7c5dd-0b31-4066-ae05-8f541614dfde already registered (sd:727) 2017-09-13 04:07:07,661-0500 INFO (monitor/58a7c5d) [storage.StorageDomain] Resource namespace 02_vol_58a7c5dd-0b31-4066-ae05-8f541614dfde already registered (sd:736) 2017-09-13 04:07:07,665-0500 INFO (ioprocess/9582) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,668-0500 INFO (monitor/f9cab69) [storage.StorageDomain] Resource namespace 01_img_f9cab69e-ae3d-4b6d-8a02-389e4b1d8764 already registered (sd:727) 2017-09-13 04:07:07,668-0500 INFO (monitor/f9cab69) [storage.StorageDomain] Resource namespace 02_vol_f9cab69e-ae3d-4b6d-8a02-389e4b1d8764 already registered (sd:736) 2017-09-13 04:07:07,669-0500 INFO (monitor/f927ceb) [storage.StorageDomain] Resource namespace 01_img_f927ceb8-91d2-41bd-ba42-dc795395b6d0 already registered (sd:727) 2017-09-13 04:07:07,669-0500 INFO (monitor/f927ceb) [storage.StorageDomain] Resource namespace 02_vol_f927ceb8-91d2-41bd-ba42-dc795395b6d0 already registered (sd:736) 2017-09-13 04:07:07,688-0500 INFO (monitor/031c06e) [storage.StorageDomain] Resource namespace 01_img_031c06e4-9ba6-4572-8800-39da4a5420f0 already registered (sd:727) 2017-09-13 04:07:07,688-0500 INFO (monitor/031c06e) [storage.StorageDomain] Resource namespace 02_vol_031c06e4-9ba6-4572-8800-39da4a5420f0 already registered (sd:736) 2017-09-13 04:07:07,725-0500 INFO (monitor/804be30) [storage.StorageDomain] Resource namespace 01_img_804be302-3763-403b-96db-febebcbfd778 already registered (sd:727) 2017-09-13 04:07:07,725-0500 INFO (monitor/804be30) [storage.StorageDomain] Resource namespace 02_vol_804be302-3763-403b-96db-febebcbfd778 already registered (sd:736) 2017-09-13 04:07:14,688-0500 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.02 seconds (__init__:539) On Mon, Sep 18, 2017 at 6:43 PM, Bryan Sockel <Bryan.Sockel@altn.com> wrote:
Here are the logs as requested.
Bryan Sockel Systems Administrator Alt-N Technologies | Grapevine, TX Office 817.601.3222 x233
Sent using Alt-N's own MDaemon Messaging Server Get to know the Alt-N family by liking us on Facebook!
-----Original Message----- From: "piotr.kliczewski@gmail.com -- Piotr Kliczewski" <piotr.kliczewski@gmail.com> To: Bryan Sockel <Bryan.Sockel@altn.com> Cc: "users@ovirt.org" <users@ovirt.org> Date: Thu, 14 Sep 2017 21:30:19 +0200 Subject: Re: [ovirt-users] Server Not Responding
Bryan,
In your engine logs I see :
2017-09-13 04:07:07,599-05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName = vm-host-colo-1, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='e75d4446-9bfc-47cb-8bf8-a2e681720b66'})' execution failed: VDSGenericzException: VDSNetworkException: Heartbeat exceeded
It would br great to understand what happened on vdsm side because the engine was still trying to connect at: 2017-09-13 09:30:46,275-05
In vdsm logs you provided I see that they start at 2017-09-13 09:01:08,895-0500 and end at 2017-09-13 09:53:24,760-0500.
Please provide vdsm logs from the time the issue occurred.
Thanks, Piotr
On Wed, Sep 13, 2017 at 5:09 PM, Bryan Sockel <Bryan.Sockel@altn.com> wrote:
Hi
Having an issue where i frequently have a server that is set to not responsive. VM's are set to unknown status, but still continue to run. This issue is isolated to just a single host. My Setup is currently a 2 Data Center Configuration with 2 servers in each data center. Issue is occurring at my remote site.
The primary storage volumes are setup on dedicated hardware, with the arbiter running on the server that is having issues. There is also another gluster replica volume hosted on this box, the replica is the other dedicated server.
The logs are showing:
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler8) [] Command 'GetCapabilitiesVDSCommand(HostName = vm-host-colo-1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='e75d4446-9bfc-47cb-8bf8-a2e681720b66', vds='Host[vm-host-colo-1,e75d4446-9bfc-47cb-8bf8-a2e681720b66]'})' execution failed: java.rmi.ConnectException: Connection timeout
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler8) [] Failure to refresh host 'vm-host-colo-1' runtime info: java.rmi.ConnectException: Connection timeout.
I have attached the vdsm.log from the server with issues and the engine.log.
Thanks
Bryan Sockel
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users