What network test validates a host?

Hello, Last week, one of our DC went through a network crash, and surprisingly, most of our hosts did resist. Some of them lost there connectivity, and were stonithed. I'd like to be sure to understand what tests are made to declare a host valid : - On the storage part, I guess EVERY[1] host is doing a read+write test (using "dd") towards the storage domain(s), every... say 5 seconds (?) In case of failure, I guess a countdown is triggered until this host is shot. But the network failure we faced was not on the dedicated storage network, but purely on the "LAN" network (5 virtual networks). - What kind of test is done on each host to declare the connectivity is OK on every virtual network? I ask that because oVirt has no knowledge of any gateway it could ping, and in some cases, some virtual networks don't even have a gateway. Is it a ping towards the SPM? Towards the engine? Is it a ping? I ask that because I found out that some host restarted nicely, ran some VMs, which had their NICs OK, but inside those guests, we find evidences that they were not able to communicate with very simple networks usually provided but the host. So I'm trying to figure out if a host could come back to life, but partially sound. [1] Thus, I don't clearly see the benefit of the SPM concept... -- Nicolas ECARNOT

On Wed, Jun 1, 2016 at 2:27 PM, Nicolas Ecarnot <nicolas@ecarnot.net> wrote:
Hello,
Last week, one of our DC went through a network crash, and surprisingly, most of our hosts did resist. Some of them lost there connectivity, and were stonithed.
I'd like to be sure to understand what tests are made to declare a host valid :
- On the storage part, I guess EVERY[1] host is doing a read+write test (using "dd") towards the storage domain(s), every... say 5 seconds (?) In case of failure, I guess a countdown is triggered until this host is shot.
But the network failure we faced was not on the dedicated storage network, but purely on the "LAN" network (5 virtual networks).
- What kind of test is done on each host to declare the connectivity is OK on every virtual network? I ask that because oVirt has no knowledge of any gateway it could ping, and in some cases, some virtual networks don't even have a gateway. Is it a ping towards the SPM? Towards the engine? Is it a ping?
I ask that because I found out that some host restarted nicely, ran some VMs, which had their NICs OK, but inside those guests, we find evidences that they were not able to communicate with very simple networks usually provided but the host. So I'm trying to figure out if a host could come back to life, but partially sound.
[1] Thus, I don't clearly see the benefit of the SPM concept...
-- Nicolas ECARNOT _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Hello Nicolas, In general, oVirt Engine checks frequently the host state by asking it to send a stats report. As part of that report, nic state is reported. Engine will move the host to non-operational in case a 'required' network nic link is down, or if it cannot reach the host through the management network. One can also use a VDSM hook to check against a reference IP for connectivity and fake the nic state. In case storage domain connectivity fails (attempts to read fails), it will report back to engine through the stats report and Engine will move the host to non-operational after a few minutes. Thanks, Edy.

On Wed, Jun 1, 2016 at 2:27 PM, Nicolas Ecarnot <nicolas@ecarnot.net> wrote:
Hello,
Last week, one of our DC went through a network crash, and surprisingly, most of our hosts did resist. Some of them lost there connectivity, and were stonithed.
I'd like to be sure to understand what tests are made to declare a host valid :
- On the storage part, I guess EVERY[1] host is doing a read+write test (using "dd") towards the storage domain(s), every... say 5 seconds (?)
We do: - every 10 seconds (irs:sd_health_check_delay) - read first block from the metadata volume - check if vg is partial (block storage) - perform statvfs call (file storage) - validate master domain mount - every 5 minutes (irs:repo_stats_cache_refresh_timeout): - run vgck (block storage) We do not check writes to the storage, I guess we should add this, or monitor sanlock status, which does write to all storage domains every 20 seconds.
In case of failure, I guess a countdown is triggered until this host is shot.
In case of failure, domain status is reported as invalid with an error code. On the engine side, we start a 5 minutes timer (configurable). If the domain did not recover from the invalid state before the timer expire, we consider the domain as failing. If the domain is failing only on one host, this host will become non-operational. If the domain is failing on all hosts it will be deactivated. I think we also try to recover the domain, but I don't know the details.
But the network failure we faced was not on the dedicated storage network, but purely on the "LAN" network (5 virtual networks).
- What kind of test is done on each host to declare the connectivity is OK on every virtual network? I ask that because oVirt has no knowledge of any gateway it could ping, and in some cases, some virtual networks don't even have a gateway. Is it a ping towards the SPM?
Engine checks the spm host status regularly, and if it fails it will try to stop it and start the spm on another host.
Towards the engine? Is it a ping?
I ask that because I found out that some host restarted nicely, ran some VMs, which had their NICs OK, but inside those guests, we find evidences that they were not able to communicate with very simple networks usually provided but the host. So I'm trying to figure out if a host could come back to life, but partially sound.
[1] Thus, I don't clearly see the benefit of the SPM concept...
The spm is the only host that can do metadata operations on shared storage. Without it, your data will be corrupted, so there is a benefit. However there are many issues with the spm, and we are working on removing it and master domain, and replacing it with more fault tolerant, efficient and easier to maintain solution. Nir
participants (3)
-
Edward Haas
-
Nicolas Ecarnot
-
Nir Soffer