Hello,
Last week, one of our DC went through a network crash, and surprisingly,
most of our hosts did resist.
Some of them lost there connectivity, and were stonithed.
I'd like to be sure to understand what tests are made to declare a host
valid :
- On the storage part, I guess EVERY[1] host is doing a read+write test
(using "dd") towards the storage domain(s), every... say 5 seconds (?)
In case of failure, I guess a countdown is triggered until this host is
shot.
But the network failure we faced was not on the dedicated storage
network, but purely on the "LAN" network (5 virtual networks).
- What kind of test is done on each host to declare the connectivity is
OK on every virtual network?
I ask that because oVirt has no knowledge of any gateway it could ping,
and in some cases, some virtual networks don't even have a gateway.
Is it a ping towards the SPM?
Towards the engine?
Is it a ping?
I ask that because I found out that some host restarted nicely, ran some
VMs, which had their NICs OK, but inside those guests, we find evidences
that they were not able to communicate with very simple networks usually
provided but the host.
So I'm trying to figure out if a host could come back to life, but
partially sound.
[1] Thus, I don't clearly see the benefit of the SPM concept...
--
Nicolas ECARNOT