Once upon a time, Lior Vernia <lvernia(a)redhat.com> said:
If I'm not mistaken, heartbeat intervals are configured to 10
seconds by
default.
Okay, thanks.
The command times out queries for the status of VMs on a host - any
reason to suspect why that's taking long? Does it happen on specific hosts?
No idea. It seemed to happen on node5 a bunch over a week, but then
there were errors on other nodes as well. It isn't always "Heartbeet
exceeded", sometimes it is "VDSNetworkException: Message timeout which
can be caused by communication issues". I haven't been able to find any
network issues that could cause this (no errors logged anywhere).
There doesn't seem to be any pattern to when it happens either. The log
entry I posted was from 04:42 local time, and a bunch of the VMs are
CentOS 5, which does log rotation at 04:00 by default (which can spike
the CPU and disk I/O), but they are all done long before 04:42. It
happened in the middle of the afternoon a couple of days ago, while I
was logged-in to the web UI, and I didn't notice any unusual behavior.
One other odd thing: I have also been experiencing an issue where I
randomly get logged out of the web UI. Usually nothing else was going
on, but a couple of times it seemed to correspond with one of the node
errors (hard to tell). It looked like the same error as BZ 1198493 (I'd
see a bunch of "Failed to log User null@N/A out" messages). I don't
know if these issues are related or that was just coincidence.
To try to rule out any unseen network issues, I started an fping to all
seven nodes and the engine from another physical system on the same
VLAN. It is sending one ping to each of the eight hosts every 0.2
seconds. That has not shown a dropped packet since I started yesterday
afternoon. However, during that time, I also have not seen any
engine/vdsm timeouts. I was going to say I had not been logged out of
the web UI, but that just happened while I was typing the previous
sentence.
--
Chris Adams <cma(a)cmadams.net>