So, in my case, I'm wondering if maybe there is some kind of weird
network issue happening.
The node that seems to be showing up most for the last day or two is one
of the two nodes running the hosted-engine HA, and is _not_ currently
hosting the engine. It seems that, at the same time the engine has
trouble communicating with that node, the hosted-engine HA running on
that node has trouble seeing the engine.
I still can't find any actual network problem. Using another physical
system, I ran fping to all the nodes and the engine with a 0.2 second
interval, and that didn't show any problem (I ran it until I also saw an
instance of the engine->node communication error). I'm watching ARP
traffic now to see if something is sending bad answers. I'm pretty
stumped at this point of what to look at next.
--
Chris Adams <cma(a)cmadams.net>