[ovirt-devel] vdsHeartbeatInSeconds and HA vs stability

Christopher Pereira kripper at imatronix.cl
Fri Jul 24 10:42:25 UTC 2015


Hi,

Today we were checking with Piotr an issue [1] related with the 
'Heartbeat exCeeded' exception that I have been seeing arround on some 
BZs recently.

Summary:

Symptom: Hosts are showing as non-responsive and VMs can't be managed or 
they are jumping between hosts.
Reason: network delays and heartbeat timeouts
Solution:  'vdsHeartbeatInSeconds' must be set to a safe value depending 
on your network latency (a value that should never be exceeded).

Too low values will make Engine show hosts as non responsive (storage 
may show unavailable, VMs can't be managed and Engine may start fencing, 
snapshoting memory and migrating VMs to new hosts)...while too high 
values will delay the migration of VMs when the hosts is really down.

Going further, you will probably have problems when having one Engine 
controlling multiple Data Centers with hosts on different networks since 
apparently the same vdsHeartbeatInSeconds value is used for all of them.

Besides, for a saturated overseas 1000 [ms] latency network, the default 
vdsHeartbeatInSeconds = 10 [s] = 10.000 [ms] was not enough, so it seems 
like the value must currently be guessed by trial and error and then set 
via console command (not obvious at all).

Maybe Engine could statistically learn average delays and deviations per 
host, or we may want to increase the default value to reduce future 
noise on BZ and have users setting their own low latency setting if they 
want "ultra HA" and win some seconds the next time a host fails (at the 
cost of risking false positives due to network delays).

To be honest, I didn't want to fill a RFE, because for my especial use 
case a host's MTF is so low that 1 or even 5 minutes downtime is 
acceptable before fencing, snapshoting memory and migrating VMs to a new 
host, so I guess I'm just going for vdsHeartbeatInSeconds = 300 to avoid 
hearbeat problems at the moment.

___
[1] : https://bugzilla.redhat.com/show_bug.cgi?id=1180079




More information about the Devel mailing list