Hi,
Today we were checking with Piotr an issue [1] related with the
'Heartbeat exCeeded' exception that I have been seeing arround on some
BZs recently.
Summary:
Symptom: Hosts are showing as non-responsive and VMs can't be managed or
they are jumping between hosts.
Reason: network delays and heartbeat timeouts
Solution: 'vdsHeartbeatInSeconds' must be set to a safe value depending
on your network latency (a value that should never be exceeded).
Too low values will make Engine show hosts as non responsive (storage
may show unavailable, VMs can't be managed and Engine may start fencing,
snapshoting memory and migrating VMs to new hosts)...while too high
values will delay the migration of VMs when the hosts is really down.
Going further, you will probably have problems when having one Engine
controlling multiple Data Centers with hosts on different networks since
apparently the same vdsHeartbeatInSeconds value is used for all of them.
Besides, for a saturated overseas 1000 [ms] latency network, the default
vdsHeartbeatInSeconds = 10 [s] = 10.000 [ms] was not enough, so it seems
like the value must currently be guessed by trial and error and then set
via console command (not obvious at all).
Maybe Engine could statistically learn average delays and deviations per
host, or we may want to increase the default value to reduce future
noise on BZ and have users setting their own low latency setting if they
want "ultra HA" and win some seconds the next time a host fails (at the
cost of risking false positives due to network delays).
To be honest, I didn't want to fill a RFE, because for my especial use
case a host's MTF is so low that 1 or even 5 minutes downtime is
acceptable before fencing, snapshoting memory and migrating VMs to a new
host, so I guess I'm just going for vdsHeartbeatInSeconds = 300 to avoid
hearbeat problems at the moment.
___
[1] :
https://bugzilla.redhat.com/show_bug.cgi?id=1180079