vdsHeartbeatInSeconds and HA vs stability

Hi, Today we were checking with Piotr an issue [1] related with the 'Heartbeat exCeeded' exception that I have been seeing arround on some BZs recently. Summary: Symptom: Hosts are showing as non-responsive and VMs can't be managed or they are jumping between hosts. Reason: network delays and heartbeat timeouts Solution: 'vdsHeartbeatInSeconds' must be set to a safe value depending on your network latency (a value that should never be exceeded). Too low values will make Engine show hosts as non responsive (storage may show unavailable, VMs can't be managed and Engine may start fencing, snapshoting memory and migrating VMs to new hosts)...while too high values will delay the migration of VMs when the hosts is really down. Going further, you will probably have problems when having one Engine controlling multiple Data Centers with hosts on different networks since apparently the same vdsHeartbeatInSeconds value is used for all of them. Besides, for a saturated overseas 1000 [ms] latency network, the default vdsHeartbeatInSeconds = 10 [s] = 10.000 [ms] was not enough, so it seems like the value must currently be guessed by trial and error and then set via console command (not obvious at all). Maybe Engine could statistically learn average delays and deviations per host, or we may want to increase the default value to reduce future noise on BZ and have users setting their own low latency setting if they want "ultra HA" and win some seconds the next time a host fails (at the cost of risking false positives due to network delays). To be honest, I didn't want to fill a RFE, because for my especial use case a host's MTF is so low that 1 or even 5 minutes downtime is acceptable before fencing, snapshoting memory and migrating VMs to a new host, so I guess I'm just going for vdsHeartbeatInSeconds = 300 to avoid hearbeat problems at the moment. ___ [1] : https://bugzilla.redhat.com/show_bug.cgi?id=1180079
participants (1)
-
Christopher Pereira