Hello!

FYI:
 updated 2 days ago 3 hypervisor from my setup to latest 3.5-patternfly , rebooted nodes, and engine and the error seems to be gone: no longer got "heartbeat exeeded".

On Tue, Mar 17, 2015 at 11:58 AM, Piotr Kliczewski <piotr.kliczewski@gmail.com> wrote:
Hi Roel,

You can change this setting in two ways.
- you can update it in db directly as you stated (not recommended)
- use engine-config -s vdsHeartbeatInSeconds=20 but prior to running
this command
  you need to update config file
/etc/ovirt-engine/engine-config/engine-config.properties
  with vdsHeartbeatInSeconds.type=Integer. This config value is not
exposed by default.

Thanks,
Piotr

On Mon, Mar 16, 2015 at 11:18 PM, Roel de Rooy <RdeRooy@motto.nl> wrote:
> HI Piotr,
>
> Thanks for your reply!
>
> If I would like to change the heartbeat value, do I have to update the value within the vdc_options table directly, or should this be done by another way (e.g. config file)?
>
> Regards,
> Roel
>
> -----Oorspronkelijk bericht-----
> Van: Piotr Kliczewski [mailto:piotr.kliczewski@gmail.com]
> Verzonden: maandag 16 maart 2015 12:16
> Aan: Roel de Rooy
> CC: Michal Skrivanek; users@ovirt.org
> Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?
>
> Unfortunately log entries that you copied give me almost no information about nature of your issue.
> There are few things that we can do to understand what is going on with your setup.
>
> Heartbeat functionality provides means to detect whether we still have connection with a host. By default heartbeat timeout is set to 10 seconds but it can be modified by setting vdsHeartbeatInSeconds.
>
> In general whenever there are no incoming responses nor heartbeat frame is not received engine will invalidate the connection and will attempt to recover. If reconnection was successful you want see any other consequences of loosing single heartbeat. I would explore stability of your network so if the network is busy or you loose network packets from time to time this kind of entries in the log are expected. You can increase heatbeat value and see whether it will work better for your env.
>
> If you confirm that your network is stable we could explore the issue further by setting debug level logging for your engine to understand exactly how the messages are processes by a host and when we receive responses.
>
>
>
> On Mon, Mar 16, 2015 at 11:34 AM, Roel de Rooy <RdeRooy@motto.nl> wrote:
>> Received the "heartbeat exeeded" continuously this morning (seems to be quiet again for now).
>> VM's still continue to work correctly and the storage domains (NFS shares) are still connected and reachable on the nodes, at the exact time that this issue is happening.
>>
>> Contacted our network engineer to see if he could see a load increase on our network, or could see any latency, errors, etc.
>> Unfortunately he could not detect anything yet (he is still investigating this).
>>
>>
>> I have attached both the engine and vdsm logs
>>
>> Engine.log:
>>
>> 2015-03-16 10:10:10,506 ERROR
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand]
>> (DefaultQuartzScheduler_Worker-45) [6d40f562] Command
>> ListVDSCommand(HostName = <HOST>, HostId =
>> 3b87597e-081b-4c89-9b1e-cb04203259f5,
>> vds=Host[<HOST>,3b87597e-081b-4c89-9b1e-cb04203259f5]) execution
>> failed. Exception: VDSNetworkException: VDSGenericException:
>> VDSNetworkException: Heartbeat exeeded
>> 2015-03-16 10:10:10,507 ERROR
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
>> (DefaultQuartzScheduler_Worker-35) [2c53103c] Command
>> SpmStatusVDSCommand(HostName = <HOST>, HostId =
>> 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId =
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d) execution failed. Exception:
>> VDSNetworkException: VDSGenericException: VDSNetworkException:
>> Heartbeat exeeded
>> 2015-03-16 10:10:10,506 WARN
>> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
>> (ResponseWorker) Exception thrown during message processing
>> 2015-03-16 10:10:10,507 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-45) [6d40f562] Host <HOST> is not responding. It will stay in Connecting state for a grace period of 88 seconds and after that an attempt to fence the host will be issued.
>> 2015-03-16 10:10:10,510 INFO
>> [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
>> (DefaultQuartzScheduler_Worker-35) [7e61eee] Running command:
>> SetStoragePoolStatusCommand internal: true. Entities affected :  ID:
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
>> 2015-03-16 10:10:10,512 INFO
>> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
>> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
>> bfa86142-6f2e-44fe-8a9c-cf4390f3b8ae status from Active to Unknown,
>> reason : null
>> 2015-03-16 10:10:10,513 INFO
>> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
>> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
>> 178a38d9-245c-43d3-bff9-6f3a5983bf03 status from Active to Unknown,
>> reason : null
>> 2015-03-16 10:10:10,514 INFO
>> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
>> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
>> 3b0b4f26-bec9-4730-a8ba-40965a228932 status from Active to Unknown,
>> reason : null
>> 2015-03-16 10:10:10,526 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-45) [6d40f562] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host <HOST> is not responding. It will stay in Connecting state for a grace period of 88 seconds and after that an attempt to fence the host will be issued.
>> 2015-03-16 10:10:10,527 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-45) [6d40f562] Failure to refresh Vds runtime info: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>         at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183) [vdsbroker.jar:]
>>         at org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand.executeVdsBrokerCommand(ListVDSCommand.java:24) [vdsbroker.jar:]
>>         at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:96) [vdsbroker.jar:]
>>         at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) [vdsbroker.jar:]
>>         at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:]
>>         at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:418) [vdsbroker.jar:]
>>         at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.fetchRunningVms(VdsUpdateRunTimeInfo.java:991) [vdsbroker.jar:]
>>         at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVmStats(VdsUpdateRunTimeInfo.java:940) [vdsbroker.jar:]
>>         at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:658) [vdsbroker.jar:]
>>         at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) [vdsbroker.jar:]
>>         at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) [vdsbroker.jar:]
>>         at sun.reflect.GeneratedMethodAccessor70.invoke(Unknown Source) [:1.7.0_75]
>>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75]
>>         at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75]
>>         at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:]
>>         at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
>>         at
>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.ja
>> va:557) [quartz.jar:]
>>
>> 2015-03-16 10:10:10,544 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-45) [6d40f562] Failed to refresh VDS , vds = 3b87597e-081b-4c89-9b1e-cb04203259f5 : <HOST>, VDS Network Error, continuing.
>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>> 2015-03-16 10:10:10,547 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-35) [7e61eee] Correlation ID: 7e61eee, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center <DC>. Setting Data Center status to Non Responsive (On host <HOST>, Error: Network error during communication with the Host.).
>> 2015-03-16 10:10:10,566 INFO
>> [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp
>> Reactor) Connecting to /<IP>
>> 2015-03-16 10:10:15,804 INFO
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand]
>> (DefaultQuartzScheduler_Worker-44) [55f2c760] START,
>> GetHardwareInfoVDSCommand(HostName = <HOST>, HostId =
>> 3b87597e-081b-4c89-9b1e-cb04203259f5,
>> vds=Host[<HOST>,3b87597e-081b-4c89-9b1e-cb04203259f5]), log id:
>> 69516bf
>> 2015-03-16 10:10:15,823 INFO
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand]
>> (DefaultQuartzScheduler_Worker-44) [55f2c760] FINISH,
>> GetHardwareInfoVDSCommand, log id: 69516bf
>> 2015-03-16 10:10:15,866 INFO
>> [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand]
>> (DefaultQuartzScheduler_Worker-44) [16f754a9] Running command:
>> HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities
>> affected :  ID: 3b87597e-081b-4c89-9b1e-cb04203259f5 Type: VDS
>> 2015-03-16 10:10:16,924 INFO
>> [org.ovirt.engine.core.bll.InitVdsOnUpCommand]
>> (DefaultQuartzScheduler_Worker-44) [6f19cd0c] Running command:
>> InitVdsOnUpCommand internal: true. Entities affected :  ID:
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
>> 2015-03-16 10:10:16,931 INFO
>> [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersComm
>> and] (DefaultQuartzScheduler_Worker-44) [64352136] Running command:
>> ConnectHostToStoragePoolServersCommand internal: true. Entities
>> affected :  ID: 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
>> 2015-03-16 10:10:17,052 INFO
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSComm
>> and] (DefaultQuartzScheduler_Worker-44) [64352136] START,
>> ConnectStorageServerVDSCommand(HostName = <HOST>, HostId =
>> 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId =
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d, storageType = NFS,
>> connectionList = [{ id: 03ea1ab7-e96c-410b-911e-905e988b0dc7,
>> connection: <IP>:/export, iqn: null, vfsType: null, mountOptions:
>> null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id:
>> 65744a96-5f4c-4d5f-898b-932eaf97084c, connection: <IP>:/iso, iqn:
>> null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans:
>> null, nfsTimeo: null };{ id: 6ca291fc-0a20-4047-9aac-9d166a4c5300,
>> connection: <IP>:/mnt/storage, iqn: null, vfsType: null, mountOptions:
>> null, nfsVersion: AUTO, nfsRetrans: null, nfsTimeo: null };]), log id:
>> 5369ca8f
>> 2015-03-16 10:10:17,113 INFO
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSComm
>> and] (DefaultQuartzScheduler_Worker-44) [64352136] FINISH,
>> ConnectStorageServerVDSCommand, return:
>> {6ca291fc-0a20-4047-9aac-9d166a4c5300=0,
>> 65744a96-5f4c-4d5f-898b-932eaf97084c=0,
>> 03ea1ab7-e96c-410b-911e-905e988b0dc7=0}, log id: 5369ca8f
>>
>>
>>
>> Corresponding vdsm.log: (these are the only lines around the same timeframe):
>>
>> Thread-52::DEBUG::2015-03-16
>> 10:10:10,977::task::595::Storage.TaskManager.Task::(_updateState)
>> Task=`89a0021d-9d5a-4563-ad44-d320aacbc551`::moving from state init ->
>> state preparing JsonRpc (StompReactor)::DEBUG::2015-03-16
>> 10:10:10,982::stompReactor::98::Broker.StompAdapter::(handle_frame)
>> Handling message <StompFrame command='SEND'>
>> Thread-52::INFO::2015-03-16
>> 10:10:10,983::logUtils::44::dispatcher::(wrapper) Run and protect:
>> getVolumeSize(sdUUID=u'178a38d9-245c-43d3-bff9-6f3a5983bf03',
>> spUUID=u'124ae76f-8acb-412e-91cc-dff9f6ec665d',
>> imgUUID=u'fb58d38b-9965-40f3-af45-915a4968a3aa',
>> volUUID=u'0c28ab0e-b1a0-42b6-8eac-71de1faa6827', options=None)
>> Thread-27::DEBUG::2015-03-16
>> 10:10:10,985::fileSD::261::Storage.Misc.excCmd::(getReadDelay)
>> /usr/bin/dd
>> if=/rhev/data-center/mnt/<IP>:_mnt_storage/178a38d9-245c-43d3-bff9-6f3
>> a5983bf03/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1
>> (cwd None)
>>
>>
>> -----Oorspronkelijk bericht-----
>> Van: users-bounces@ovirt.org [mailto:users-bounces@ovirt.org] Namens
>> Piotr Kliczewski
>> Verzonden: 16 March 2015 08:39
>> Aan: Michal Skrivanek
>> CC: users@ovirt.org
>> Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?
>>
>> Can you please provide logs from both ends?
>>
>> On Fri, Mar 13, 2015 at 3:17 PM, Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
>>>
>>> On 13 Mar 2015, at 14:39, Chris Adams wrote:
>>>
>>>> Once upon a time, Roel de Rooy <RdeRooy@motto.nl> said:
>>>>> We are observing the same thing with our oVirt environment.
>>>>> At random moments (could be a couple of times a day , once a day or even once every couple of days), we receive the "VDSNetworkException" message on one of our nodes.
>>>>> Haven't seen the "heartbeat exceeded" message, but could be that I overlooked it within our logs.
>>>>> At some rare occasions, we also do see "Host cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center", within the GUI.
>>>>>
>>>>> VM's will continue to run normally and most of the times the nodes will be in "UP" state again within the same minute.
>>>>>
>>>>> Will still haven't found the root cause of this issue.
>>>>> Our engine is CentOS 6.6 based and it's happing with both Centos 6 and Fedora 20 nodes.
>>>>> We are using a LCAP bond of 1Gbit ports for our management network.
>>>>>
>>>>> As we didn't see any reports about this before, we are currently looking if something network related is causing this.
>>>>
>>>> I just opened a BZ on it (since it isn't just me):
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1201779
>>>>
>>>> My cluster went a couple of days without hitting this (as soon as I
>>>> posted to the list of course), but then it happened several times
>>>> overnight.  Interestingly, one error logged was communicating with
>>>> the node currently running my hosted engine.  That should rule out
>>>> external network (e.g. switch and such) issues, as those packets
>>>> should not have left the physical box.
>>>
>>> well, hosted engine complicates things as you'd need to be able to
>>> see the status of the engine guest running a standalone engine installation or at least running that hosted engine on a single node without any other VM may help….
>>>
>>> Thanks,
>>> michal
>>>
>>>>
>>>> --
>>>> Chris Adams <cma@cmadams.net>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users@ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users@ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>> _______________________________________________
>> Users mailing list
>> Users@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users