Hi Did,
Thanks for your comments.
Yes, I do have redundancy for network and storage connections.
I`m testing a catastrophic scenario of losing all communication from a host and having a
crashed host on which the SHE runs.
I intend to understand what to expect from running VMs and the Engine application.
As you said, all VMs running on other hosts keep running without impacting them.
I will try to collect more information from the logs and understand the reference codes
and constants you mentioned.
Thanks again for your help.
Marcos Sungaila
-----Original Message-----
From: Yedidyah Bar David <didi(a)redhat.com>
Sent: Wednesday, September 21, 2022 2:46 AM
To: Marcos Sungaila <marcos.sungaila(a)oracle.com>
Cc: users(a)ovirt.org
Subject: [External] : Re: [ovirt-users] Self-hosted-engine timeout and recovering time
On Wed, Sep 21, 2022 at 12:22 AM Marcos Sungaila <marcos.sungaila(a)oracle.com>
wrote:
Hi all,
I have a cluster running the 4.4.10 release with 6 KVM hosts and Self-Hosted-Engine.
What storage?
I'm testing some network outage scenarios, and I faced strange
behavior.
I suppose you have redundancy in your network.
It's important to clarify (for yourself, mainly) what exactly you test, what's
important, what's expected, etc.
After disconnecting the KVM hosts hosting the SHE, there was a long
timeout until switching the Self-Hosted-Engine to another host as expected.
I suggest studying the ha-agent logs, /var/log/ovirt-hosted-engine-ha/agent.log.
Much of the relevant code is in ovirt_hosted_engine_ha/agent/states.py
(in the git repo, or under /usr/lib/python3.6/site-packages/ on your machine).
Also, there took a relatively long time to take over the HA VMs from
the failing server.
That's a separate issue, about which I personally know very little.
You might want to start a separate thread about it.
I do know, though, that if you keep the storage connected, the host might be able to keep
updating VM leases on the storage. See e.g.:
https://urldefense.com/v3/__https://www.ovirt.org/develop/release-managem...
I didn't check the admin guide, but I suppose it has some material about HA VMs.
Is there a configuration where I can reduce the SHE timeout to make
this recover process faster?
IIRC there is nothing user-configurable.
You can see most relevant constants in
ovirt_hosted_engine_ha/agent/constants.py{,.in}.
Nothing stops you from changing them, but please note that this is somewhat risky, and I
strongly suggest to do very careful testing with your new settings. It might make sense to
try to methodically go through all the possible state changes in the above state
machine.
The general assumption is that network and storage, for critical setups, are redundant,
and that the engine itself is not considered critical, in the sense that if it's dead,
all your VMs are still alive. And also, that it's more important to not corrupt VM
disk images (e.g. by starting the VM concurrently on two hosts) than to keep the VM
alive.
Best regards,
--
Didi