Received the "heartbeat exeeded" continuously this morning (seems to be quiet
again for now).
VM's still continue to work correctly and the storage domains (NFS shares) are still
connected and reachable on the nodes, at the exact time that this issue is happening.
Contacted our network engineer to see if he could see a load increase on our network, or
could see any latency, errors, etc.
Unfortunately he could not detect anything yet (he is still investigating this).
I have attached both the engine and vdsm logs
Engine.log:
2015-03-16 10:10:10,506 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand]
(DefaultQuartzScheduler_Worker-45) [6d40f562] Command ListVDSCommand(HostName =
<HOST>, HostId = 3b87597e-081b-4c89-9b1e-cb04203259f5,
vds=Host[<HOST>,3b87597e-081b-4c89-9b1e-cb04203259f5]) execution failed. Exception:
VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2015-03-16 10:10:10,507 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
(DefaultQuartzScheduler_Worker-35) [2c53103c] Command SpmStatusVDSCommand(HostName =
<HOST>, HostId = 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId =
124ae76f-8acb-412e-91cc-dff9f6ec665d) execution failed. Exception: VDSNetworkException:
VDSGenericException: VDSNetworkException: Heartbeat exeeded
2015-03-16 10:10:10,506 WARN [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
(ResponseWorker) Exception thrown during message processing
2015-03-16 10:10:10,507 WARN [org.ovirt.engine.core.vdsbroker.VdsManager]
(DefaultQuartzScheduler_Worker-45) [6d40f562] Host <HOST> is not responding. It will
stay in Connecting state for a grace period of 88 seconds and after that an attempt to
fence the host will be issued.
2015-03-16 10:10:10,510 INFO
[org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
(DefaultQuartzScheduler_Worker-35) [7e61eee] Running command: SetStoragePoolStatusCommand
internal: true. Entities affected : ID: 124ae76f-8acb-412e-91cc-dff9f6ec665d Type:
StoragePool
2015-03-16 10:10:10,512 INFO
[org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
(DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
bfa86142-6f2e-44fe-8a9c-cf4390f3b8ae status from Active to Unknown, reason : null
2015-03-16 10:10:10,513 INFO
[org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
(DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
178a38d9-245c-43d3-bff9-6f3a5983bf03 status from Active to Unknown, reason : null
2015-03-16 10:10:10,514 INFO
[org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
(DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
3b0b4f26-bec9-4730-a8ba-40965a228932 status from Active to Unknown, reason : null
2015-03-16 10:10:10,526 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-45) [6d40f562] Correlation ID: null, Call Stack: null,
Custom Event ID: -1, Message: Host <HOST> is not responding. It will stay in
Connecting state for a grace period of 88 seconds and after that an attempt to fence the
host will be issued.
2015-03-16 10:10:10,527 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-45) [6d40f562] Failure to refresh Vds runtime info:
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException:
VDSNetworkException: Heartbeat exeeded
at
org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand.executeVdsBrokerCommand(ListVDSCommand.java:24)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:96)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56)
[vdsbroker.jar:]
at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31)
[dal.jar:]
at
org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:418)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.fetchRunningVms(VdsUpdateRunTimeInfo.java:991)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVmStats(VdsUpdateRunTimeInfo.java:940)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:658)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494)
[vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236)
[vdsbroker.jar:]
at sun.reflect.GeneratedMethodAccessor70.invoke(Unknown Source) [:1.7.0_75]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[rt.jar:1.7.0_75]
at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75]
at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60)
[scheduler.jar:]
at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
[quartz.jar:]
2015-03-16 10:10:10,544 WARN [org.ovirt.engine.core.vdsbroker.VdsManager]
(DefaultQuartzScheduler_Worker-45) [6d40f562] Failed to refresh VDS , vds =
3b87597e-081b-4c89-9b1e-cb04203259f5 : <HOST>, VDS Network Error, continuing.
VDSGenericException: VDSNetworkException: Heartbeat exeeded
2015-03-16 10:10:10,547 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-35) [7e61eee] Correlation ID: 7e61eee, Call Stack: null,
Custom Event ID: -1, Message: Invalid status on Data Center <DC>. Setting Data
Center status to Non Responsive (On host <HOST>, Error: Network error during
communication with the Host.).
2015-03-16 10:10:10,566 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL
Stomp Reactor) Connecting to /<IP>
2015-03-16 10:10:15,804 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand]
(DefaultQuartzScheduler_Worker-44) [55f2c760] START, GetHardwareInfoVDSCommand(HostName =
<HOST>, HostId = 3b87597e-081b-4c89-9b1e-cb04203259f5,
vds=Host[<HOST>,3b87597e-081b-4c89-9b1e-cb04203259f5]), log id: 69516bf
2015-03-16 10:10:15,823 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand]
(DefaultQuartzScheduler_Worker-44) [55f2c760] FINISH, GetHardwareInfoVDSCommand, log id:
69516bf
2015-03-16 10:10:15,866 INFO
[org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand]
(DefaultQuartzScheduler_Worker-44) [16f754a9] Running command:
HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected : ID:
3b87597e-081b-4c89-9b1e-cb04203259f5 Type: VDS
2015-03-16 10:10:16,924 INFO [org.ovirt.engine.core.bll.InitVdsOnUpCommand]
(DefaultQuartzScheduler_Worker-44) [6f19cd0c] Running command: InitVdsOnUpCommand
internal: true. Entities affected : ID: 124ae76f-8acb-412e-91cc-dff9f6ec665d Type:
StoragePool
2015-03-16 10:10:16,931 INFO
[org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand]
(DefaultQuartzScheduler_Worker-44) [64352136] Running command:
ConnectHostToStoragePoolServersCommand internal: true. Entities affected : ID:
124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
2015-03-16 10:10:17,052 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand]
(DefaultQuartzScheduler_Worker-44) [64352136] START,
ConnectStorageServerVDSCommand(HostName = <HOST>, HostId =
3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId =
124ae76f-8acb-412e-91cc-dff9f6ec665d, storageType = NFS, connectionList = [{ id:
03ea1ab7-e96c-410b-911e-905e988b0dc7, connection: <IP>:/export, iqn: null, vfsType:
null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id:
65744a96-5f4c-4d5f-898b-932eaf97084c, connection: <IP>:/iso, iqn: null, vfsType:
null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id:
6ca291fc-0a20-4047-9aac-9d166a4c5300, connection: <IP>:/mnt/storage, iqn: null,
vfsType: null, mountOptions: null, nfsVersion: AUTO, nfsRetrans: null, nfsTimeo: null
};]), log id: 5369ca8f
2015-03-16 10:10:17,113 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand]
(DefaultQuartzScheduler_Worker-44) [64352136] FINISH, ConnectStorageServerVDSCommand,
return: {6ca291fc-0a20-4047-9aac-9d166a4c5300=0, 65744a96-5f4c-4d5f-898b-932eaf97084c=0,
03ea1ab7-e96c-410b-911e-905e988b0dc7=0}, log id: 5369ca8f
Corresponding vdsm.log: (these are the only lines around the same timeframe):
Thread-52::DEBUG::2015-03-16
10:10:10,977::task::595::Storage.TaskManager.Task::(_updateState)
Task=`89a0021d-9d5a-4563-ad44-d320aacbc551`::moving from state init -> state
preparing
JsonRpc (StompReactor)::DEBUG::2015-03-16
10:10:10,982::stompReactor::98::Broker.StompAdapter::(handle_frame) Handling message
<StompFrame command='SEND'>
Thread-52::INFO::2015-03-16 10:10:10,983::logUtils::44::dispatcher::(wrapper) Run and
protect: getVolumeSize(sdUUID=u'178a38d9-245c-43d3-bff9-6f3a5983bf03',
spUUID=u'124ae76f-8acb-412e-91cc-dff9f6ec665d',
imgUUID=u'fb58d38b-9965-40f3-af45-915a4968a3aa',
volUUID=u'0c28ab0e-b1a0-42b6-8eac-71de1faa6827', options=None)
Thread-27::DEBUG::2015-03-16
10:10:10,985::fileSD::261::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd
if=/rhev/data-center/mnt/<IP>:_mnt_storage/178a38d9-245c-43d3-bff9-6f3a5983bf03/dom_md/metadata
iflag=direct of=/dev/null bs=4096 count=1 (cwd None)
-----Oorspronkelijk bericht-----
Van: users-bounces(a)ovirt.org [mailto:users-bounces@ovirt.org] Namens Piotr Kliczewski
Verzonden: 16 March 2015 08:39
Aan: Michal Skrivanek
CC: users(a)ovirt.org
Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?
Can you please provide logs from both ends?
On Fri, Mar 13, 2015 at 3:17 PM, Michal Skrivanek <michal.skrivanek(a)redhat.com>
wrote:
On 13 Mar 2015, at 14:39, Chris Adams wrote:
> Once upon a time, Roel de Rooy <RdeRooy(a)motto.nl> said:
>> We are observing the same thing with our oVirt environment.
>> At random moments (could be a couple of times a day , once a day or even once
every couple of days), we receive the "VDSNetworkException" message on one of
our nodes.
>> Haven't seen the "heartbeat exceeded" message, but could be that I
overlooked it within our logs.
>> At some rare occasions, we also do see "Host cannot access the Storage
Domain(s) <UNKNOWN> attached to the Data Center", within the GUI.
>>
>> VM's will continue to run normally and most of the times the nodes will be in
"UP" state again within the same minute.
>>
>> Will still haven't found the root cause of this issue.
>> Our engine is CentOS 6.6 based and it's happing with both Centos 6 and Fedora
20 nodes.
>> We are using a LCAP bond of 1Gbit ports for our management network.
>>
>> As we didn't see any reports about this before, we are currently looking if
something network related is causing this.
>
> I just opened a BZ on it (since it isn't just me):
>
>
https://bugzilla.redhat.com/show_bug.cgi?id=1201779
>
> My cluster went a couple of days without hitting this (as soon as I
> posted to the list of course), but then it happened several times
> overnight. Interestingly, one error logged was communicating with
> the node currently running my hosted engine. That should rule out
> external network (e.g. switch and such) issues, as those packets
> should not have left the physical box.
well, hosted engine complicates things as you'd need to be able to see
the status of the engine guest running a standalone engine installation or at least
running that hosted engine on a single node without any other VM may help….
Thanks,
michal
>
> --
> Chris Adams <cma(a)cmadams.net>
> _______________________________________________
> Users mailing list
> Users(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users