Bryan,
I checked your logs and I see extensive activity of ioprocess at that
time which seems to cause getAllVmStats to be processed 7 seconds
later.
The call was invoked on vdsm side much later than we saw heartbeat
exceeded exception on the engine side.
@Nir please take a look
Thanks,
Piotr
2017-09-13 04:07:07,126-0500 INFO (itmap/0) [IOProcessClient]
Starting client ioprocess-10031 (__init__:330)
2017-09-13 04:07:07,176-0500 INFO (itmap/1) [IOProcessClient]
Starting client ioprocess-10032 (__init__:330)
2017-09-13 04:07:07,222-0500 INFO (ioprocess/9466) [IOProcess]
Starting ioprocess (__init__:452)
2017-09-13 04:07:07,226-0500 INFO (itmap/2) [IOProcessClient]
Starting client ioprocess-10033 (__init__:330)
2017-09-13 04:07:07,279-0500 INFO (ioprocess/9474) [IOProcess]
Starting ioprocess (__init__:452)
2017-09-13 04:07:07,284-0500 INFO (itmap/3) [IOProcessClient]
Starting client ioprocess-10034 (__init__:330)
2017-09-13 04:07:07,332-0500 INFO (ioprocess/9482) [IOProcess]
Starting ioprocess (__init__:452)
2017-09-13 04:07:07,340-0500 INFO (itmap/4) [IOProcessClient]
Starting client ioprocess-10035 (__init__:330)
2017-09-13 04:07:07,395-0500 INFO (ioprocess/9502) [IOProcess]
Starting ioprocess (__init__:452)
2017-09-13 04:07:07,401-0500 INFO (itmap/5) [IOProcessClient]
Starting client ioprocess-10036 (__init__:330)
2017-09-13 04:07:07,461-0500 INFO (ioprocess/9523) [IOProcess]
Starting ioprocess (__init__:452)
2017-09-13 04:07:07,469-0500 INFO (itmap/6) [IOProcessClient]
Starting client ioprocess-10037 (__init__:330)
2017-09-13 04:07:07,525-0500 INFO (ioprocess/9538) [IOProcess]
Starting ioprocess (__init__:452)
2017-09-13 04:07:07,532-0500 INFO (itmap/7) [IOProcessClient]
Starting client ioprocess-10038 (__init__:330)
2017-09-13 04:07:07,579-0500 INFO (ioprocess/9556) [IOProcess]
Starting ioprocess (__init__:452)
2017-09-13 04:07:07,583-0500 INFO (itmap/8) [IOProcessClient]
Starting client ioprocess-10039 (__init__:330)
2017-09-13 04:07:07,628-0500 INFO (ioprocess/9569) [IOProcess]
Starting ioprocess (__init__:452)
2017-09-13 04:07:07,642-0500 INFO (monitor/cacbac0)
[storage.StorageDomain] Resource namespace
01_img_cacbac0c-4b08-46cd-ace8-b23266c92291 already registered
(sd:727)
2017-09-13 04:07:07,642-0500 INFO (monitor/cacbac0)
[storage.StorageDomain] Resource namespace
02_vol_cacbac0c-4b08-46cd-ace8-b23266c92291 already registered
(sd:736)
2017-09-13 04:07:07,646-0500 INFO (monitor/e371d38)
[storage.StorageDomain] Resource namespace
01_img_e371d380-7194-4950-b901-5f2aed5dfb35 already registered
(sd:727)
2017-09-13 04:07:07,646-0500 INFO (monitor/e371d38)
[storage.StorageDomain] Resource namespace
02_vol_e371d380-7194-4950-b901-5f2aed5dfb35 already registered
(sd:736)
2017-09-13 04:07:07,655-0500 INFO (monitor/ea15319)
[storage.StorageDomain] Resource namespace
01_img_ea153191-ac25-4984-aaec-4a59cbf383a5 already registered
(sd:727)
2017-09-13 04:07:07,655-0500 INFO (monitor/ea15319)
[storage.StorageDomain] Resource namespace
02_vol_ea153191-ac25-4984-aaec-4a59cbf383a5 already registered
(sd:736)
2017-09-13 04:07:07,661-0500 INFO (monitor/58a7c5d)
[storage.StorageDomain] Resource namespace
01_img_58a7c5dd-0b31-4066-ae05-8f541614dfde already registered
(sd:727)
2017-09-13 04:07:07,661-0500 INFO (monitor/58a7c5d)
[storage.StorageDomain] Resource namespace
02_vol_58a7c5dd-0b31-4066-ae05-8f541614dfde already registered
(sd:736)
2017-09-13 04:07:07,665-0500 INFO (ioprocess/9582) [IOProcess]
Starting ioprocess (__init__:452)
2017-09-13 04:07:07,668-0500 INFO (monitor/f9cab69)
[storage.StorageDomain] Resource namespace
01_img_f9cab69e-ae3d-4b6d-8a02-389e4b1d8764 already registered
(sd:727)
2017-09-13 04:07:07,668-0500 INFO (monitor/f9cab69)
[storage.StorageDomain] Resource namespace
02_vol_f9cab69e-ae3d-4b6d-8a02-389e4b1d8764 already registered
(sd:736)
2017-09-13 04:07:07,669-0500 INFO (monitor/f927ceb)
[storage.StorageDomain] Resource namespace
01_img_f927ceb8-91d2-41bd-ba42-dc795395b6d0 already registered
(sd:727)
2017-09-13 04:07:07,669-0500 INFO (monitor/f927ceb)
[storage.StorageDomain] Resource namespace
02_vol_f927ceb8-91d2-41bd-ba42-dc795395b6d0 already registered
(sd:736)
2017-09-13 04:07:07,688-0500 INFO (monitor/031c06e)
[storage.StorageDomain] Resource namespace
01_img_031c06e4-9ba6-4572-8800-39da4a5420f0 already registered
(sd:727)
2017-09-13 04:07:07,688-0500 INFO (monitor/031c06e)
[storage.StorageDomain] Resource namespace
02_vol_031c06e4-9ba6-4572-8800-39da4a5420f0 already registered
(sd:736)
2017-09-13 04:07:07,725-0500 INFO (monitor/804be30)
[storage.StorageDomain] Resource namespace
01_img_804be302-3763-403b-96db-febebcbfd778 already registered
(sd:727)
2017-09-13 04:07:07,725-0500 INFO (monitor/804be30)
[storage.StorageDomain] Resource namespace
02_vol_804be302-3763-403b-96db-febebcbfd778 already registered
(sd:736)
2017-09-13 04:07:14,688-0500 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer]
RPC call Host.getAllVmStats succeeded in 0.02 seconds (__init__:539)
On Mon, Sep 18, 2017 at 6:43 PM, Bryan Sockel <Bryan.Sockel(a)altn.com> wrote:
Here are the logs as requested.
Bryan Sockel
Systems Administrator
Alt-N Technologies | Grapevine, TX
Office 817.601.3222 x233
Sent using Alt-N's own MDaemon Messaging Server
Get to know the Alt-N family by liking us on Facebook!
-----Original Message-----
From: "piotr.kliczewski(a)gmail.com -- Piotr Kliczewski"
<piotr.kliczewski(a)gmail.com>
To: Bryan Sockel <Bryan.Sockel(a)altn.com>
Cc: "users(a)ovirt.org" <users(a)ovirt.org>
Date: Thu, 14 Sep 2017 21:30:19 +0200
Subject: Re: [ovirt-users] Server Not Responding
Bryan,
In your engine logs I see :
2017-09-13 04:07:07,599-05 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand]
(DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName
= vm-host-colo-1, VdsIdVDSCommandParametersBase:{runAsync='true',
hostId='e75d4446-9bfc-47cb-8bf8-a2e681720b66'})' execution failed:
VDSGenericzException: VDSNetworkException: Heartbeat exceeded
It would br great to understand what happened on vdsm side because the
engine was still trying to connect at: 2017-09-13 09:30:46,275-05
In vdsm logs you provided I see that they start at 2017-09-13
09:01:08,895-0500 and end at 2017-09-13 09:53:24,760-0500.
Please provide vdsm logs from the time the issue occurred.
Thanks,
Piotr
On Wed, Sep 13, 2017 at 5:09 PM, Bryan Sockel <Bryan.Sockel(a)altn.com> wrote:
>
> Hi
>
> Having an issue where i frequently have a server that is set to not
> responsive. VM's are set to unknown status, but still continue to run.
> This issue is isolated to just a single host. My Setup is currently a 2
> Data Center Configuration with 2 servers in each data center. Issue is
> occurring at my remote site.
>
> The primary storage volumes are setup on dedicated hardware, with the
> arbiter running on the server that is having issues. There is also
> another
> gluster replica volume hosted on this box, the replica is the other
> dedicated server.
>
> The logs are showing:
>
> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand]
> (DefaultQuartzScheduler8) [] Command 'GetCapabilitiesVDSCommand(HostName =
> vm-host-colo-1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true',
> hostId='e75d4446-9bfc-47cb-8bf8-a2e681720b66',
> vds='Host[vm-host-colo-1,e75d4446-9bfc-47cb-8bf8-a2e681720b66]'})'
> execution
> failed: java.rmi.ConnectException: Connection timeout
>
> [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
> (DefaultQuartzScheduler8) [] Failure to refresh host 'vm-host-colo-1'
> runtime info: java.rmi.ConnectException: Connection timeout.
>
>
> I have attached the vdsm.log from the server with issues and the
> engine.log.
>
> Thanks
>
> Bryan Sockel
>
>
> _______________________________________________
> Users mailing list
> Users(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/users
>