[ovirt-users] Re: Self-hosted-engine timeout and recovering time

21 Sep 2022

      On Wed, Sep 21, 2022 at 12:22 AM Marcos Sungaila
<marcos.sungaila@oracle.com> wrote:
...
Hi all,
I have a cluster running the 4.4.10 release with 6 KVM hosts and Self-Hosted-Engine.
What storage?
...
I'm testing some network outage scenarios, and I faced strange behavior.
I suppose you have redundancy in your network.

It's important to clarify (for yourself, mainly) what exactly you
test, what's important, what's expected, etc.
...
After disconnecting the KVM hosts hosting the SHE, there was a long timeout until switching the Self-Hosted-Engine to another host as expected.
I suggest studying the ha-agent logs, /var/log/ovirt-hosted-engine-ha/agent.log.

Much of the relevant code is in ovirt_hosted_engine_ha/agent/states.py
(in the git repo, or under /usr/lib/python3.6/site-packages/ on your
machine).
...
Also, there took a relatively long time to take over the HA VMs from the failing server.
That's a separate issue, about which I personally know very little.
You might want to start a separate thread about it.

I do know, though, that if you keep the storage connected, the host
might be able to keep updating VM leases on the storage. See e.g.:

https://www.ovirt.org/develop/release-management/features/storage/vm-leases....

I didn't check the admin guide, but I suppose it has some material about HA VMs.
...
Is there a configuration where I can reduce the SHE timeout to make this recover process faster?
IIRC there is nothing user-configurable.

You can see most relevant constants in
ovirt_hosted_engine_ha/agent/constants.py{,.in}.
Nothing stops you from changing them, but please note that this is
somewhat risky, and I strongly suggest to do very careful testing with
your new settings. It might make sense to try to methodically go
through all the possible state changes in the above state machine.

The general assumption is that network and storage, for critical
setups, are redundant, and that the engine itself is not considered
critical, in the sense that if it's dead, all your VMs are still
alive. And also, that it's more important to not corrupt VM disk
images (e.g. by starting the VM concurrently on two hosts) than to
keep the VM alive.

Best regards,
-- 
Didi

[ovirt-users] Re: Self-hosted-engine timeout and recovering time

Yedidyah Bar David