On Wed, Jan 25, 2023 at 8:22 AM Yedidyah Bar David <didi(a)redhat.com> wrote:
On Wed, Jan 25, 2023 at 2:08 AM Peter H <peter(a)hashbang.org> wrote:
> Does a config file exist where this timeout can be set to a lower value?
I intended to provide a short reply just pointing out what value to
change, then realized this might not be helpful, so decided to give up
and not reply. Then I decided to take this opportunity and write the
following.
For background, please see:
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/HEKKBM6MZEKB...
Thanks for taking the time to answer me at such length. I was unaware
of the tool engine-config(1) even after maintaining an oVirt cluster
for 3 years... The tool was just what I was looking for.
You do not need to be a developer, to search and read source code.
One
of the biggest advantages of FOSS is that you can do this, even
without knowing how to write/update it.
I installed my first Linux OS (Slackware) back in '93 so I have
downloaded and compiled my fair share of Free/OpenSource projects. In
the last century I also got a couple of kernel patches accepted. I
actually checked the code a couple of years ago while investigating
the logic behind the dropdown menu regarding VM types. I found out the
UI code was written in some Java framework that was quite hard to
understand for me.
My main work in oVirt was in packaging/setup/backup/restore, not in
the engine itself or vdsm - the two main parts of the project. But I
know enough to guess that the error message you got is from the
engine. I already have the engine source code git cloned on my laptop,
so grepped it for 'for server to finish reboot', and found this in
backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/VdsCommand.java:
I acknowledge that I could have found the error message in the code
but I'm unsure if I then had made the connection that would have me
discover the engine-setup(1) tool.
private void sleepOnReboot(final VDSStatus status) {
int sleepTimeInSec = Config.<Integer>
getValue(ConfigValues.ServerRebootTimeout);
log.info("Waiting {} seconds, for server to finish reboot process.",
sleepTimeInSec);
Even without knowing Java, ServerRebootTimeout seems relevant.
grepping for this, finds it also in:
packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:582:select
fn_db_add_config_value('ServerRebootTimeout','600','general');
packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1460:--
Increase default ServerRebootTimeout from 5 to 10 minutes
packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql:1461:select
fn_db_update_default_config_value('ServerRebootTimeout', '300',
'600',
'general', false);
where it's set and then updated, and in:
packaging/etc/engine-config/engine-config.properties:119:ServerRebootTimeout.description="Host
Reboot Timeout (in seconds)"
packaging/etc/engine-config/engine-config.properties:120:ServerRebootTimeout.type=Integer
where it's exposed to engine-config. So if all you want is to get this
error message earlier, this should be enough.
I can confirm that the ServerRebootTimeout is set to 300 in our
current 4.4.1 installation.
I have also tested that I can change it in my 4.5.4 test system using:
engine-config -s ServerRebootTimeout=300
systemctl restart ovirt-engine
However, I also checked the git log (or blame, if you want, but I
prefer the log) for the former file, trying to understand when and why
it was changed from 5 to 10 minutes. 'git log -u
packaging/dbscripts/upgrade/pre_upgrade/0000_config.sql' and then
searching for 'ServerRebootTimeout' finds
https://github.com/oVirt/ovirt-engine/commit/d324bbdd . This links at
https://bugzilla.redhat.com/1947403 . That one sadly does not provide
many more details. It does show that it was done in 4.4.6. So I can
only guess that one of two things happened:
1. Someone complained that hosts become non-operational e.g. because
their boot sequence/POST/whatever takes more than 5 minutes. Perhaps
this was rare enough to be reported and handled only recently (two
years ago, and not, say, 10). (Although I personally managed machines
that needed more than 5 minutes to reboot, or even just test the RAM -
but that's indeed rare).
We have some HP servers in our 4.4.1 installation that take around 6-7
minutes to reboot from selecting Restart through the SSH Management
(dropdown menu). Even Though this is longer than 5 minutes (300 secs)
the installation never fails. The state is Reboot for 5 minutes then
it goes into NonResponsive for some minutes until the state is set to
Up.
2. Something else changed, and made this less comfortable. E.g.
perhaps the engine didn't move them in the past to non-operational and
now does, or something like that.
Not sure which of these, it at all.
I'm seeing some differences between 4.4.1 and 4.5.4.
In 4.4.1 with default timeout of 10 minutes a host would be set to Up
just after reboot. That could be within 3, 5 or 7 minutes.
In 4.5.4 this seems to have changed.
With a 10 minute timeout a host that reboots in 5 minutes will not be
set to "Up" before the 10 minutes have gone. It's a long time to sit
and wait when you know the host is up.
If I lower the timeout to 1 minute the host state will go from Reboot
to NonResponsive and finally Up very shortly after the host has
started up again.
So in 4.5.4 it seems that hosts can't connect until
ServerRebootTimeout seconds after the reboot was initiated.
You are welcome to change it to some low value using engine-config
and
see if it helps. If it's "just enough", you should notice no
difference from previous versions.
Setting the timeout to 1 minute seems to work fine. We will probably
just need to work on our alerting rules because the hosts will be in
NonResponsive state for a while before coming "Up".
3. Try to find out why the above patch was needed and think if
it's
important enough for you to dive deeper and provide, or at least
suggest, a fix/change/whatever that will make it unnecessary.
That said, we (the RHV team) are still interested, and if you think
it's a significant issue/bug, we might decide to spend the time and
come up with a fix/change ourselves. In the past, I might have tried
to guess who might know more details and Cced them. But at this point
in time, I decided it's better to explain what users can do by
themselves. I hope this is helpful, and that enough people will take
the time to read, understand, and apply my explanation as applicable.
Best regards,
--
Didi
The issue is not a blocker for us as things work. It's just a matter
of how long we have to wait or deal with the NonResponsive state. We
can install hosts and add them to the cluster without failures.
For the versions 4.5.1 - 4.5.3 it was another matter. Here the reboot
timeout caused the installations to fail. I'm wondering about how this
was not seen by others or some regression.
Once again, thanks for your answer.
BR
Peter