Hi,
I know this may sounds few related to oVirt, but here is what happened
here :
We have two oVirt datacenters in 3.4.1 (and 3.4 compat. level), and
amongst many other things, we have a ctdb cluster composed of two nodes,
each of them being in a different datacenter.
Some days ago, I raise the compat. level from 3.3 to 3.4, upgraded the
hosts of datacenter#1 from centOS 6.4 to 6.5 (datacenter#2 was already
6.5), and our two (VM)ctdb nodes (centOS 6.5) were constantly failing
into cluster partition.
Almost 3 days of googling lead me to a point that seems know for some
times, being : multicast on bridged interfaces is more or less
supported, or randomly supported.
See :
http://lists.corosync.org/pipermail/discuss/2012-November/002208.html
On the concerned hosts, I tried the advised workaround, and yes, that
stabilized the situation.
I understand this is not directly oVirt/RHEV related, but I really don't
get why this ctdb cluster has worked for months, and stopped recently
(I'll have to deeply dig into the release notes of the upgraded packages
and try to find something useful).
I post that here to :
- ask if some of you are also running clusters amongst VMs (not
particularly amongst datacenters - VM discussion amongst hosts may also
be an issue)
- leave a trace in case that may help debug some setups
Regards,
--
Nicolas Ecarnot