[ovirt-users] Communication errors between engine and nodes?

Gabi C gabicr at gmail.com
Thu Mar 19 07:39:57 UTC 2015


Hello!

FYI:
 updated 2 days ago 3 hypervisor from my setup to latest 3.5-patternfly ,
rebooted nodes, and engine and the error seems to be gone: no longer got
"heartbeat exeeded".

On Tue, Mar 17, 2015 at 11:58 AM, Piotr Kliczewski <
piotr.kliczewski at gmail.com> wrote:

> Hi Roel,
>
> You can change this setting in two ways.
> - you can update it in db directly as you stated (not recommended)
> - use engine-config -s vdsHeartbeatInSeconds=20 but prior to running
> this command
>   you need to update config file
> /etc/ovirt-engine/engine-config/engine-config.properties
>   with vdsHeartbeatInSeconds.type=Integer. This config value is not
> exposed by default.
>
> Thanks,
> Piotr
>
> On Mon, Mar 16, 2015 at 11:18 PM, Roel de Rooy <RdeRooy at motto.nl> wrote:
> > HI Piotr,
> >
> > Thanks for your reply!
> >
> > If I would like to change the heartbeat value, do I have to update the
> value within the vdc_options table directly, or should this be done by
> another way (e.g. config file)?
> >
> > Regards,
> > Roel
> >
> > -----Oorspronkelijk bericht-----
> > Van: Piotr Kliczewski [mailto:piotr.kliczewski at gmail.com]
> > Verzonden: maandag 16 maart 2015 12:16
> > Aan: Roel de Rooy
> > CC: Michal Skrivanek; users at ovirt.org
> > Onderwerp: Re: [ovirt-users] Communication errors between engine and
> nodes?
> >
> > Unfortunately log entries that you copied give me almost no information
> about nature of your issue.
> > There are few things that we can do to understand what is going on with
> your setup.
> >
> > Heartbeat functionality provides means to detect whether we still have
> connection with a host. By default heartbeat timeout is set to 10 seconds
> but it can be modified by setting vdsHeartbeatInSeconds.
> >
> > In general whenever there are no incoming responses nor heartbeat frame
> is not received engine will invalidate the connection and will attempt to
> recover. If reconnection was successful you want see any other consequences
> of loosing single heartbeat. I would explore stability of your network so
> if the network is busy or you loose network packets from time to time this
> kind of entries in the log are expected. You can increase heatbeat value
> and see whether it will work better for your env.
> >
> > If you confirm that your network is stable we could explore the issue
> further by setting debug level logging for your engine to understand
> exactly how the messages are processes by a host and when we receive
> responses.
> >
> >
> >
> > On Mon, Mar 16, 2015 at 11:34 AM, Roel de Rooy <RdeRooy at motto.nl> wrote:
> >> Received the "heartbeat exeeded" continuously this morning (seems to be
> quiet again for now).
> >> VM's still continue to work correctly and the storage domains (NFS
> shares) are still connected and reachable on the nodes, at the exact time
> that this issue is happening.
> >>
> >> Contacted our network engineer to see if he could see a load increase
> on our network, or could see any latency, errors, etc.
> >> Unfortunately he could not detect anything yet (he is still
> investigating this).
> >>
> >>
> >> I have attached both the engine and vdsm logs
> >>
> >> Engine.log:
> >>
> >> 2015-03-16 10:10:10,506 ERROR
> >> [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand]
> >> (DefaultQuartzScheduler_Worker-45) [6d40f562] Command
> >> ListVDSCommand(HostName = <HOST>, HostId =
> >> 3b87597e-081b-4c89-9b1e-cb04203259f5,
> >> vds=Host[<HOST>,3b87597e-081b-4c89-9b1e-cb04203259f5]) execution
> >> failed. Exception: VDSNetworkException: VDSGenericException:
> >> VDSNetworkException: Heartbeat exeeded
> >> 2015-03-16 10:10:10,507 ERROR
> >> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
> >> (DefaultQuartzScheduler_Worker-35) [2c53103c] Command
> >> SpmStatusVDSCommand(HostName = <HOST>, HostId =
> >> 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId =
> >> 124ae76f-8acb-412e-91cc-dff9f6ec665d) execution failed. Exception:
> >> VDSNetworkException: VDSGenericException: VDSNetworkException:
> >> Heartbeat exeeded
> >> 2015-03-16 10:10:10,506 WARN
> >> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
> >> (ResponseWorker) Exception thrown during message processing
> >> 2015-03-16 10:10:10,507 WARN
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-45) [6d40f562] Host <HOST> is not
> responding. It will stay in Connecting state for a grace period of 88
> seconds and after that an attempt to fence the host will be issued.
> >> 2015-03-16 10:10:10,510 INFO
> >> [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
> >> (DefaultQuartzScheduler_Worker-35) [7e61eee] Running command:
> >> SetStoragePoolStatusCommand internal: true. Entities affected :  ID:
> >> 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
> >> 2015-03-16 10:10:10,512 INFO
> >> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
> >> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
> >> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
> >> bfa86142-6f2e-44fe-8a9c-cf4390f3b8ae status from Active to Unknown,
> >> reason : null
> >> 2015-03-16 10:10:10,513 INFO
> >> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
> >> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
> >> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
> >> 178a38d9-245c-43d3-bff9-6f3a5983bf03 status from Active to Unknown,
> >> reason : null
> >> 2015-03-16 10:10:10,514 INFO
> >> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
> >> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
> >> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
> >> 3b0b4f26-bec9-4730-a8ba-40965a228932 status from Active to Unknown,
> >> reason : null
> >> 2015-03-16 10:10:10,526 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-45) [6d40f562] Correlation ID: null, Call
> Stack: null, Custom Event ID: -1, Message: Host <HOST> is not responding.
> It will stay in Connecting state for a grace period of 88 seconds and after
> that an attempt to fence the host will be issued.
> >> 2015-03-16 10:10:10,527 ERROR
> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
> (DefaultQuartzScheduler_Worker-45) [6d40f562] Failure to refresh Vds
> runtime info:
> org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
> VDSGenericException: VDSNetworkException: Heartbeat exeeded
> >>         at
> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183)
> [vdsbroker.jar:]
> >>         at
> org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand.executeVdsBrokerCommand(ListVDSCommand.java:24)
> [vdsbroker.jar:]
> >>         at
> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:96)
> [vdsbroker.jar:]
> >>         at
> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56)
> [vdsbroker.jar:]
> >>         at
> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31)
> [dal.jar:]
> >>         at
> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:418)
> [vdsbroker.jar:]
> >>         at
> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.fetchRunningVms(VdsUpdateRunTimeInfo.java:991)
> [vdsbroker.jar:]
> >>         at
> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVmStats(VdsUpdateRunTimeInfo.java:940)
> [vdsbroker.jar:]
> >>         at
> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:658)
> [vdsbroker.jar:]
> >>         at
> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494)
> [vdsbroker.jar:]
> >>         at
> org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236)
> [vdsbroker.jar:]
> >>         at sun.reflect.GeneratedMethodAccessor70.invoke(Unknown Source)
> [:1.7.0_75]
> >>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [rt.jar:1.7.0_75]
> >>         at java.lang.reflect.Method.invoke(Method.java:606)
> [rt.jar:1.7.0_75]
> >>         at
> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60)
> [scheduler.jar:]
> >>         at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
> [quartz.jar:]
> >>         at
> >> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.ja
> >> va:557) [quartz.jar:]
> >>
> >> 2015-03-16 10:10:10,544 WARN
> [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-45) [6d40f562] Failed to refresh VDS , vds =
> 3b87597e-081b-4c89-9b1e-cb04203259f5 : <HOST>, VDS Network Error,
> continuing.
> >> VDSGenericException: VDSNetworkException: Heartbeat exeeded
> >> 2015-03-16 10:10:10,547 WARN
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-35) [7e61eee] Correlation ID: 7e61eee, Call
> Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center
> <DC>. Setting Data Center status to Non Responsive (On host <HOST>, Error:
> Network error during communication with the Host.).
> >> 2015-03-16 10:10:10,566 INFO
> >> [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp
> >> Reactor) Connecting to /<IP>
> >> 2015-03-16 10:10:15,804 INFO
> >> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand]
> >> (DefaultQuartzScheduler_Worker-44) [55f2c760] START,
> >> GetHardwareInfoVDSCommand(HostName = <HOST>, HostId =
> >> 3b87597e-081b-4c89-9b1e-cb04203259f5,
> >> vds=Host[<HOST>,3b87597e-081b-4c89-9b1e-cb04203259f5]), log id:
> >> 69516bf
> >> 2015-03-16 10:10:15,823 INFO
> >> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand]
> >> (DefaultQuartzScheduler_Worker-44) [55f2c760] FINISH,
> >> GetHardwareInfoVDSCommand, log id: 69516bf
> >> 2015-03-16 10:10:15,866 INFO
> >> [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand]
> >> (DefaultQuartzScheduler_Worker-44) [16f754a9] Running command:
> >> HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities
> >> affected :  ID: 3b87597e-081b-4c89-9b1e-cb04203259f5 Type: VDS
> >> 2015-03-16 10:10:16,924 INFO
> >> [org.ovirt.engine.core.bll.InitVdsOnUpCommand]
> >> (DefaultQuartzScheduler_Worker-44) [6f19cd0c] Running command:
> >> InitVdsOnUpCommand internal: true. Entities affected :  ID:
> >> 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
> >> 2015-03-16 10:10:16,931 INFO
> >> [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersComm
> >> and] (DefaultQuartzScheduler_Worker-44) [64352136] Running command:
> >> ConnectHostToStoragePoolServersCommand internal: true. Entities
> >> affected :  ID: 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
> >> 2015-03-16 10:10:17,052 INFO
> >> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSComm
> >> and] (DefaultQuartzScheduler_Worker-44) [64352136] START,
> >> ConnectStorageServerVDSCommand(HostName = <HOST>, HostId =
> >> 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId =
> >> 124ae76f-8acb-412e-91cc-dff9f6ec665d, storageType = NFS,
> >> connectionList = [{ id: 03ea1ab7-e96c-410b-911e-905e988b0dc7,
> >> connection: <IP>:/export, iqn: null, vfsType: null, mountOptions:
> >> null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id:
> >> 65744a96-5f4c-4d5f-898b-932eaf97084c, connection: <IP>:/iso, iqn:
> >> null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans:
> >> null, nfsTimeo: null };{ id: 6ca291fc-0a20-4047-9aac-9d166a4c5300,
> >> connection: <IP>:/mnt/storage, iqn: null, vfsType: null, mountOptions:
> >> null, nfsVersion: AUTO, nfsRetrans: null, nfsTimeo: null };]), log id:
> >> 5369ca8f
> >> 2015-03-16 10:10:17,113 INFO
> >> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSComm
> >> and] (DefaultQuartzScheduler_Worker-44) [64352136] FINISH,
> >> ConnectStorageServerVDSCommand, return:
> >> {6ca291fc-0a20-4047-9aac-9d166a4c5300=0,
> >> 65744a96-5f4c-4d5f-898b-932eaf97084c=0,
> >> 03ea1ab7-e96c-410b-911e-905e988b0dc7=0}, log id: 5369ca8f
> >>
> >>
> >>
> >> Corresponding vdsm.log: (these are the only lines around the same
> timeframe):
> >>
> >> Thread-52::DEBUG::2015-03-16
> >> 10:10:10,977::task::595::Storage.TaskManager.Task::(_updateState)
> >> Task=`89a0021d-9d5a-4563-ad44-d320aacbc551`::moving from state init ->
> >> state preparing JsonRpc (StompReactor)::DEBUG::2015-03-16
> >> 10:10:10,982::stompReactor::98::Broker.StompAdapter::(handle_frame)
> >> Handling message <StompFrame command='SEND'>
> >> Thread-52::INFO::2015-03-16
> >> 10:10:10,983::logUtils::44::dispatcher::(wrapper) Run and protect:
> >> getVolumeSize(sdUUID=u'178a38d9-245c-43d3-bff9-6f3a5983bf03',
> >> spUUID=u'124ae76f-8acb-412e-91cc-dff9f6ec665d',
> >> imgUUID=u'fb58d38b-9965-40f3-af45-915a4968a3aa',
> >> volUUID=u'0c28ab0e-b1a0-42b6-8eac-71de1faa6827', options=None)
> >> Thread-27::DEBUG::2015-03-16
> >> 10:10:10,985::fileSD::261::Storage.Misc.excCmd::(getReadDelay)
> >> /usr/bin/dd
> >> if=/rhev/data-center/mnt/<IP>:_mnt_storage/178a38d9-245c-43d3-bff9-6f3
> >> a5983bf03/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1
> >> (cwd None)
> >>
> >>
> >> -----Oorspronkelijk bericht-----
> >> Van: users-bounces at ovirt.org [mailto:users-bounces at ovirt.org] Namens
> >> Piotr Kliczewski
> >> Verzonden: 16 March 2015 08:39
> >> Aan: Michal Skrivanek
> >> CC: users at ovirt.org
> >> Onderwerp: Re: [ovirt-users] Communication errors between engine and
> nodes?
> >>
> >> Can you please provide logs from both ends?
> >>
> >> On Fri, Mar 13, 2015 at 3:17 PM, Michal Skrivanek <
> michal.skrivanek at redhat.com> wrote:
> >>>
> >>> On 13 Mar 2015, at 14:39, Chris Adams wrote:
> >>>
> >>>> Once upon a time, Roel de Rooy <RdeRooy at motto.nl> said:
> >>>>> We are observing the same thing with our oVirt environment.
> >>>>> At random moments (could be a couple of times a day , once a day or
> even once every couple of days), we receive the "VDSNetworkException"
> message on one of our nodes.
> >>>>> Haven't seen the "heartbeat exceeded" message, but could be that I
> overlooked it within our logs.
> >>>>> At some rare occasions, we also do see "Host cannot access the
> Storage Domain(s) <UNKNOWN> attached to the Data Center", within the GUI.
> >>>>>
> >>>>> VM's will continue to run normally and most of the times the nodes
> will be in "UP" state again within the same minute.
> >>>>>
> >>>>> Will still haven't found the root cause of this issue.
> >>>>> Our engine is CentOS 6.6 based and it's happing with both Centos 6
> and Fedora 20 nodes.
> >>>>> We are using a LCAP bond of 1Gbit ports for our management network.
> >>>>>
> >>>>> As we didn't see any reports about this before, we are currently
> looking if something network related is causing this.
> >>>>
> >>>> I just opened a BZ on it (since it isn't just me):
> >>>>
> >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1201779
> >>>>
> >>>> My cluster went a couple of days without hitting this (as soon as I
> >>>> posted to the list of course), but then it happened several times
> >>>> overnight.  Interestingly, one error logged was communicating with
> >>>> the node currently running my hosted engine.  That should rule out
> >>>> external network (e.g. switch and such) issues, as those packets
> >>>> should not have left the physical box.
> >>>
> >>> well, hosted engine complicates things as you'd need to be able to
> >>> see the status of the engine guest running a standalone engine
> installation or at least running that hosted engine on a single node
> without any other VM may help….
> >>>
> >>> Thanks,
> >>> michal
> >>>
> >>>>
> >>>> --
> >>>> Chris Adams <cma at cmadams.net>
> >>>> _______________________________________________
> >>>> Users mailing list
> >>>> Users at ovirt.org
> >>>> http://lists.ovirt.org/mailman/listinfo/users
> >>>
> >>> _______________________________________________
> >>> Users mailing list
> >>> Users at ovirt.org
> >>> http://lists.ovirt.org/mailman/listinfo/users
> >> _______________________________________________
> >> Users mailing list
> >> Users at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/users
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150319/751f4df3/attachment-0001.html>


More information about the Users mailing list