[ovirt-users] Communication errors between engine and nodes?

Roel de Rooy RdeRooy at motto.nl
Fri Mar 13 13:11:12 UTC 2015


We are observing the same thing with our oVirt environment.
At random moments (could be a couple of times a day , once a day or even once every couple of days), we receive the "VDSNetworkException" message on one of our nodes.
Haven't seen the "heartbeat exceeded" message, but could be that I overlooked it within our logs.
At some rare occasions, we also do see "Host cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center", within the GUI.

VM's will continue to run normally and most of the times the nodes will be in "UP" state again within the same minute.

Will still haven't found the root cause of this issue.
Our engine is CentOS 6.6 based and it's happing with both Centos 6 and Fedora 20 nodes.
We are using a LCAP bond of 1Gbit ports for our management network.

As we didn't see any reports about this before, we are currently looking if something network related is causing this.
 




 
-----Oorspronkelijk bericht-----
Van: users-bounces at ovirt.org [mailto:users-bounces at ovirt.org] Namens Chris Adams
Verzonden: 12 March 2015 14:23
Aan: users at ovirt.org
Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?

Once upon a time, Lior Vernia <lvernia at redhat.com> said:
> If I'm not mistaken, heartbeat intervals are configured to 10 seconds 
> by default.

Okay, thanks.

> The command times out queries for the status of VMs on a host - any 
> reason to suspect why that's taking long? Does it happen on specific hosts?

No idea.  It seemed to happen on node5 a bunch over a week, but then there were errors on other nodes as well.  It isn't always "Heartbeet exceeded", sometimes it is "VDSNetworkException: Message timeout which can be caused by communication issues".  I haven't been able to find any network issues that could cause this (no errors logged anywhere).

There doesn't seem to be any pattern to when it happens either.  The log entry I posted was from 04:42 local time, and a bunch of the VMs are CentOS 5, which does log rotation at 04:00 by default (which can spike the CPU and disk I/O), but they are all done long before 04:42.  It happened in the middle of the afternoon a couple of days ago, while I was logged-in to the web UI, and I didn't notice any unusual behavior.

One other odd thing: I have also been experiencing an issue where I randomly get logged out of the web UI.  Usually nothing else was going on, but a couple of times it seemed to correspond with one of the node errors (hard to tell).  It looked like the same error as BZ 1198493 (I'd see a bunch of "Failed to log User null at N/A out" messages).  I don't know if these issues are related or that was just coincidence.

To try to rule out any unseen network issues, I started an fping to all seven nodes and the engine from another physical system on the same VLAN.  It is sending one ping to each of the eight hosts every 0.2 seconds.  That has not shown a dropped packet since I started yesterday afternoon.  However, during that time, I also have not seen any engine/vdsm timeouts.  I was going to say I had not been logged out of the web UI, but that just happened while I was typing the previous sentence.

--
Chris Adams <cma at cmadams.net>
_______________________________________________
Users mailing list
Users at ovirt.org
http://lists.ovirt.org/mailman/listinfo/users



More information about the Users mailing list